DQCOBP»T RES ORB 



ID ier597 

ROTHOB * 

• • • 

POB! DATP 

GFRNT 

NOtE 

DESCPIPTOPS 



■TD5KTIFIFPS 



' IR OOS ftllO 

* ■ ' ■ . 

BcGill, Michael ' ' . . ' 

Ar Fvalus^-lcr. cf Fac**?*?* ftffec*lr.g Docuiaent 'E!»nkir.a 

by Inf oraa^ior. Fetrlev»l Sv«*eirs. 

tvracusG Dniv,, M.Y. School o€ Information 

Studies'. 

National Scie'^ce Pcund^ition, Washington, D.C. 
Oc* 7<? 

*l-§r-TST-7B-lQtt5U 
HF01/PC06 Plue Pos^^ae^ 

♦ ftlgori^hiBs; *Coffii5arat * ve Analysis; Data Bases; D^+a 
Proce^s?,ng: Irformation Needs; *Inf ormatxon 
Pe*ri0v«l: Search stra/*eo!es:- *.Oser Satisfac**lor. 
(Tnf ormat^ont 

♦Boolean , PlQetra;. *Docm!i©rt Packing 



RBSTPRjC^ 



s studv of rarkina »ld'>ri*hnis used in a Boole?" 
environiaent is based- on an evaluation cf factors affecting docume'^t / 
rarkinjor bv inf crmation- re^riev? I systesj?. The algorithms were 
decoapqsed into term weightira scheme's ^nd similarity Deasux'eSr 
rejpresent^tively selected froin fho«?e known to exist in infcrnation' 
ret'*i4**l ^nvircniaent p, before being tested on docu^nts subtnitte'd by^ 
soecific clearinahouses to the ClJt data basie. • Search.es were conducted 
using 'inforoa^ion need et^tements f.TOiB individuals with interests 
conoruent to the database, a'^d documents retrieved by professional 
searcherf? were iudaed fcrV relevance by *'>iose submitting the t*eed 
s^'atement?. Thi«« input was analyzed accotdina to the ability of the 
*lgori*hi!»« ^o move 'relevant documents toward the beginning of the 
cuto^lt list usino the c<»efficient of rankino effectiveness (CPE>; 
While it is possible to slanif icantlv improve the order of the r^utout 
usinq either • control'led vocabulary or free text, tHe rankira is a* 
best abou*- 20 percent effective at this time. It :.s suggested tha* 
the factors currently " used! in rankino sylaoritjims iire not I'' kely *c 
make racking closer to 100 oercen* effective- A bibliography of 5 9 
fteferences is included. fRu*hor/»RRI 



•V 



* Peproductfons <?uppllea by ^DPS ^re ^-he be«?t that can be ia?^.de * * 
^ ^ f!^oip the doctiine'^-t* ^ 



CDUCATfOm A WiiP^ A»C 
ftlATimftL iMftf If Ut« 01^ 
« « •OUCATtQftl 

V'- ^ Th»^ DOCUMfNt HAS SttN PS#0- 

OuC€0 £M4CTi¥ A!i BfCfiVCD « «0M 
THE PCAiONOR 0»GAIMl/ATiQNQ&«G{N. 
* ATiNG (T POINTS O*^ VfE^^ QPtNlQNS 

^ i€N» 0*»*C*Ai NAflONAi iNSTrTuTf O* 



AN EVALUATION. OF FACTORS * 
AFFECTING DOCUMENT RANKING BY 
INFOP14ATI0N RETRIEVAL SYSTEMS 

October ,1979 



t 



Principal Investigator 
Michael McGill 



Research Associates 
Matthew Roll 
Terry Noreault' 




• * 



This material is based upon research supported 
by the National Science Foundation ^ Division of 
Information Science and Technology und^r Grant 
NSF-IST-78- 10434. The opinions, findings, arid 
conclusions or recororoendations expressed in 
this report are those of the authors and do 
not necessarily reflect the views of the 
National Science Foundation. 



• PERMISSION TO REPRODUCE THIS 
fcfATEBIAL HAS BEEN GRANTED BY 

School of Information Studies Michael J. McGlll 
Syracuse Uniy-ersity " 
Syracuse, New York 13210 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC).'' 



ABSTRACT 

^ This is a report of a study of ranking algorithms used in^ 

* 

a Boolean environment. The ranking algorithms are decomposed 
into term weighting schemes and similarity measures. Represen* 
tative term' weights and similarity measures are* selected from 

those known to exist in information retrieval environments. 

' » *■ 

The ranking algorithms, are tested' using do cumentg submitted by 
specific clearinghouses to t^e Current Index to Journals in 

Education data base. 

■» • 

The study used information need statements frooh i.idividuals' 

■ * 

ft 

with interests congruent with the data base. After searches' « 

* » 

were conducted by professional searchers, 'the retrieved documents 
were judged for relevance by the persons submitting the original 

» » - * 

information need statement. This provided the input to study 

the ranking algorithms. > 

The algorithms were analyzed according to their ability t© 
move relevant documents toward the beginning of the output list. ^ 
The coefficient of ranking effectiveness (CRE) was used to 
measure this ability. The study found that when using. a controlled 
vocabulary or the free text, it is possible to significantly 
improve the order of the output. The results also indicate 
that ranking is at best about 20% effective with the remaining 
8x0% not yet resolved. is suggested that the factors - 

currently used in ranking algorithms are not likely to make 
ranking closer to 100% effective. Rather, new information ii^t 
likely .to be required. 



ACKNOWLEDGEMENTS 

t - ^ 

This report constitates the'.finkl report for Grant 
NSP- 1ST- 7 8- 10 4 54, entitled AN EVALUATION OF FACTORS 
AFPECrriNG * DOCUMENT RANiCING BY INFORMATION RETRIEVAL SYSTEMS. 

« 

The project was supported t>i the Division of Information 
Science and Technology of the National Science Foundation. 
The 'grant period ran from September 1, 1978 to^ February 29, 
1980. ' r 

* 

The study r«quireci the resources of many individuals 

including the individuals providing information need state*- 

roertts and relevance judgen\,ents.' Peggy Montgomery 'was a prime 

coordifiator, counselor, and typist for the project. Chris 

FoxV simulation study Is a valuable contribution. The 

analysis was made possible through the guidance of Jeffrey' 

Katzer, with assistance from Rich Vefth an<3 Chris' Fox. The- 

final report; was edited by Cheryl McAfee. The able advice of 

Jennifer Kuehn, Gerard Saltan and Karen Sbarck iones also 

t • . - 

significantly contributed to this^ study. 



TABLE -OF CONTENTS 



Introduction 

Components and Environment of Ranking 
Algorithms 

a. Form of Document Representation (DR) 

b. Term Weighting in the Document 
'-sSepresentation . 

c. Porfti of the Query CQF) * 

d. Term Weighting in the Query (TW) 

e. Similarity Measure (SM) 

Review of Relievant .Literature 
Approach and Methodology 
Restrictions 

Methodological. R^uirements 
Objectives of the Study ' 
Procedure 

Review and Selection of TWs 
Review and Selection of SMs 
' Description and Loading of the Data' Base 

ft 

a. Introduction 

b. Description of Data Base 

c. Construction of the Inverted Files 

d. Comparison of Construction Times 

Collecting Interest Statements • 
Intermediaries and Sear'ching 
Query Processing 
a. Lost Responses 

Measuring the Effectiveness of Ranking ' 
Overview of Results 



Results Using the Controlled • 
Representation 

Results Using the Free Representation 
Efficiency Considerations 
Cost of TWs 

« 

a. Category 1 

b. 'Category 2 

c. Category 3 

f 

Cost Qf SMS 
^ Conclusion * 
References 

Appendix* A - Commands ^or SIRE 
Appendix B - Forms . • . ■ 



\ 



TABLES 



I Table 1 



n 



- Term Weightings * 
Table 2 - Siifilarity Measures 

V 

Table 3 - ■ Unique SMs After Binary gimulation * 

- Records in Data Base By Clearinghouse 



• Table 4 
Table 5 

a 

Table 6 
Tab^le 7 
Table 8 
• Tabls^ 9 
Table 10 



- Description of Data Base by 
Representation 

■ \ 

- Processing Time^or Pile Cons-t ruction 

- Coirparison of File Sizes 

- Summary of Search Characteristics 

- Overlap Percentages 



Table 11 - 



Table 12 - 



Table 13 - 



Table 14 - 



Table 15 - 



Table 16 - 



Table 17 - 



Matrix of CRE Values 
Controlled Representation 

Analysis of Variance Results 
Term Weighting Scheme 
Controlled Representation 

Analysis of Variance Results 
Similarity Measure 
Controlled Representat4.on 

Significant Differences between CRE 
Means for Similarity Measures 
Controlled Representation 

Matrix of CRE Values 
Free Representation 

Analysis .of Variance Results 
Term Weighting Schemes 
Free Representation 

Analysis of Variance Results ^ 
Similarity Measures 
Free Representation 

Significant Differences Between CRE 
Means for' Similarity Measures 
Free Representation ' ' . 



' 28 
36 
58 
62 
63 

66 
66 
74 
76 
87 

88 
89 

% 

V 

90 

93 
94 

95 

97 



r > 
4 



FIGURES 



Figure 1 - Ranking Algorithm Model 

Figure 2 - Inverted File 

Figure 3 Performance Measure Difficulty 

N 

Figure 4 - Correlations Between SMs Isased on 

Binary Simulation of ISjBueries 

Figure 5 - Outp)it from READIN 

/ % • • 

Figure 6 . - Output from MERGE/SORT 
Figure 7 - Output froro-MAKDIC 

. y ■ 

Figure 8 - Calculation of the Coefficient 
^ of Ranking Effectiveness 



INTRODUCflON 



4 



Ranking the output* of an information .system on the 

' . y ■ ' . ' ' 

cifiterion of :>robable relevance to Jthe query is considered an 
optimal strategy in information retrieval (Maron and Kuhns, 
1960>s^sbharft, 1975? Lancaster and Payen^ 1973). Ranked output 
attempt^ to provide the user with information indicating that 
the closer a document is to the begrnnin^f of the output list^ 
the more likel^r it is- tb be relevant to his query. 



»- 



In the- context of a docu^nent retrieval 'system a ranking 
algorithm defines an ordering on a set of documents, ordering 
the documents according to their degree of similarity to a 
query. In the simplest cslse (a binary decision rule) the 
document set is ordered into two classes; those satisfying the 
retrieval' criteria and those ftot satisfying the criteria. More 
complicatea ranking' algorithms 'may be defined in order to provide 
orderings of greater detail, creating to N classes of output, 
where N is -the number of documents. retrieved. 

The absence of a systematic collection, clSissif ication, 
and comparison of ranking algorithms vas iSoted by Sager and 
Lockeman (Sager and Lockeman, 1976) . Subsequently they began 
the task of systematically exploring ranking algorithms. They 



1 The phrase "document retrieval" will be usedNalthcugh 
"computerized reference retrieval system" is a more 
appropriate description of this type of systei^ 



identified components which could theoretically be .combined to 
form 990 different ranking algorithms? unfortunately only 14 of 
thepe algorithms could actually be tested given their experi- 
mental conditions. Their results were constrained by the fact 
that the ranking algorithms were testable in only one retrieval 
environment (defined belowl and they encountered other difficulties, 
such as problems with relevance judgments (Sager and Lockeman. • 
1976, p. 24). Thus, work on the careful examination of ranking 
al^rithms was begun, but much theoretical and empirical work ' 
remained* ' 

Ranked output has been a process of unknown effectiveness, . 
requiring heavy user effort or occurring only in the context of 
SMAKP-like systems. In the SMART-like systems', the ranking 
process cannot be isolated from the retrieval -process *for a 
particular investigation. Inverted file |ystems, on the other' 
hand-, traditionally keep infor^tion which only allows simple 
ranking and often requires a great deal of user effort. 

Innovations in the Syracuse Information Retrieval System 
(SIRE) (McGill, et. al., 19*76), allow numerous and sophisticated 
ranking methods to be studied in an inverted file context. 

* • 

Specific components of ranking algorithms and the retrieval 
environment can be isolated and . simultaneously varied so that the 
efforts of each 6omponent and each combinaition of components can 
be 'Observed. 

This is a report of a study which examined 504 different 

10 



ERIC" 



4 



J- 



3 



raiding dXgorithins ^.n two different environments. The study was 
y coniSmS'it^ Sfept^er.l, 1978 throiigh August 31, 1979. The 

rei^rt consi'^eisg^the environment and the components of ranking' 
algorithms, the relevant literature, the methods^ used in this 
study and the results \nd implications of the coirected data. 



COMPONENTS AND ENVI1«)M MENT OF RANKIMG ALCQ BTThmc 

A f-jidamental model of a document' retrieval system is 
shown in Figure 1. This model is not new, but it provides an 
essential framework for the understanding of ranking algorithms. 
There. are many other models which view the information retrieval 
, process from other perspectives with different componeniss (e.g. 
S^racejrio 1968, p. xii) . The model in Figure 1 is different in 
.that the components and place of rank^sng algorithms in a retrieval 
system are clearly included. This model clarifies the relation- 
ships between c.ertaj.iv processes . For example, it shows the 
complementary and analogous roles of defining a query for an 
information need and the role of defining the set of descriptors 
for a document. From- this model It clear that the schemes 
for applying different weights to terms in the query or document . 
representations can be isolated, from" each' other and from the 
formation of the representations. Further, the model clarifies 
the isolation of the similarity measure from the weighting and 
desctiptions . The five key elements comprisihg a ranking 
algorithm and its environment are; 

a) FORM OF DOCUMF-NT REPRESENTATION (DR) . Form of document 



Query 



Similarity 
sieasure 




Document . ' 
representations 



Query 
weighting 



Documents 




l^. Term 
weighting 



Oocument representation 
formation 



FIGURE 1 



RANKING ALGORITHM MODEL 



representation refers to the manner of selecting the descriptors 
(index terms) by which the document will be examined by an 
algorithm to determine whether to retrieve or not to retrieve 
the dociment. Indexing language variables constitute a large 
portion of Dociunent Representation variability. 

A document can be' represented by any combination of 
classification codes or index '^erms assigred manually or auto- 

* • 

o'^tici'ly from a controlled or "uncontrolled vocabulary. Terms 
may be extracted from the document or portions of the document 
(e.g. title, author, abstract, citations) . There are also 
variations in the form, structure, depth and breadth of 
indexing languages. \ ** 

The product of the docuirtent representation process, for 
any document represent^itioa and'-any^ocument, j, can be thought 
of as a document vector « a^j ajj^ ajj ---amj' 



a= 



0 if document j is not indexed Jsy term i 

1 if document j is indexed by term i. 



m= number of index terms in the vocabulary 
1 ^ i i m 



15 



Tern 


Doc. 1 


Doc. 2 






Apex 


X 


0 




Apple 


0 


1 


1 ....... 0 


Baker 


1 

• 


0 . 


1 1 


Bun 
• 


A 
V 

. \ 


U 




• I 








texjn M 


0 


1 

* 


0 1 




• 

FIGURE 2 


- INVERTED FILE 


The vector 


terminology (e.g. 


"document vector") is 



applicable to inverted file systems as well as to SMART systems. 
In Figure 2 a row is referred to as a "term vector" - noting the 
presence or absence of a particular term across all documents. 
A column is a "document vector" - noting the presence or absence 
of each term in a particular document. The process of indexing ' 
a document may thus be described as the creation of a document 
vector representing that particular information item. 

b) TERM WEIGHTING IN THE DOCUMENT REPRESENTATION (TW) . ' 
Term weighting schemes determine how much emphasis is placed on 
the occurrence (s) of each index term. Sager and Lockeroan 
identified 22 such schemes (S^ger and Lockeman, 1976). The 
elementary weighting scheme is, of course, "unweighted". This 



scheme assigns a 1 or 0 for/4he presence or absence of a tern, 
respectively. More ^complex schemes may count ti^e number of 
occtrrrences of the term in the document, normalized by such 
tfaCtors as the number of terms used to represent the document, 
the frequency with which a term occurs throughout the data 'base, 
the overall frequency distribution and probabilities of the term 
occurrences, or th^ term's pattern of co-occurrence with other * 
terms . The term weights may be described by* a vector of co- 
efiicients (w^^^ ^-^^ •••^mj^ cor respondfig document 

representation vector. A weighted document representation vector 

^j'^l^lj "2^2 j ^3^3j ••• ^'m^j developed by the element by 
element multipjlication of the two vectors. Additionally, the^ 
elements of the docutnent representation vector do not necessarily 
have' to represent index, terms themselves, but could be underlying 
factors discovered by analysis of the tfxt of the collection 
(see Switzer, 196S) or o€lier selected attributes (see Cleveland,. 
1976) . . 

c) FORM OF THE QUERY (QF) . ^Analogous to the conversion 
of a document to a document representation, an information need 
must be converted into a query. Queries are pategori zed here 
as belonging to one of two forms,. Boolean of "natural" language^. 
Naturally, the query formation process results in a request 
expressed in the same language as the document representation. 

2 Natujral language queries may be considered as identical to a 
Boolo^n query consisting of the same set of t^rms, all 
connected by ORs with some subsequent process irtg (McGill 
et al. 1976) . . . 



Thus, there are theoretically two. query forms (Booleai)| and 
natural language) for each document representation form. In 
eitl-ier case the query can be represented by h vector corresponding 
to the document representation vector (i.e. having the column 
vectors represent the same terms, factors or attributes). Other 
factors which may influence the query foripeftion include the use/ 
non-use of an intera^diary , the form of ihp man-machine dialog, 
relevance feedback techniques, thesauri, adjacency operators and 
generality/specificity of the quer^'. 

^ d) TERM WEIGHTING IN THE QUEftY (TW) . Weighting qo- ^ 

efficients may be assigned to query vector elements as they are 
to document representation .vectors. Query terms c^ be weighted 
equally, acco3rdiji5_to_their^r(^^ of occuri^ncejXn the query, 

manually, according to the searcher's perceptions of the importance 
of feach term, or, in situations v^sin^ relevance feedback, as* a 
function of the term's pattern of occurrences in relev^^t and 
•non- relevant documents (Yu and Salton, 1977) . 

^ ' . ■ *" 

e) SIMILARITY MEASURE (SM) . A Similarity measure is an 

algorithm |irhich confutes the degree of ^agreement between entities. 

For this study, the concern is with a query vector and document 

•representation vectors*. There, are many vectbr association 

measures described in the litera^ture, (see for example, van , 

Rijsbergen, 1975, 31-34; Reitsma, 1968). One pimple measure 

yields a 1 if ^^1^ (Qi.Dj^)i«0 and 0 if (Q^ .Dj^) «0 where Q and D 

are the query and document representation vectors and m is the 

nuisber of terms in the vocabulary. More complex measures may 

take into account- terms not present in either t)ie query or 



document representation vector in addition to nijiriber of terms, 
frequency and probability data. 

i 

Ranking algorithms are ccwosed of the three units ^ in 
Figure 1, QW, TW and SM. The two units DR and QF constitute the 
environment of a ranking algorithm. The object of a ranking 
algorithm is to predict the relevance of .each doctanent and place 
the documents in descending order to predicted relevance.. Thus, 
as the output list is read from beginning to«end, each document is 
more likely to be relevant than those following it. 

ranking algorithms do not alter the composition of the set 
of /documents retrieved by a query. That is, in a given environ- 
rodht (QF and DR)' and a given data base, a document retrieval 

rstem will- produce the identical set of documents in response to* 
a query regardless of the ranking, algorithm ew^jloyed, provided 
that a cutoff value on the similarity, score* is hot being u^ed to 
restrjlct the size of xhe output list. Conceptually, rankihg 
algorithms work after the retrieved set is formed to effect, the 
o;rder in which the documents are' displayed.. 

This is precisely the way the two-step retrieval process 
has been implemented in the Syracuse Information Retrieval 
Experiment (SlRE) (McGill, et al., 1976? Noreault et al., 1977). 
First the retrieved set is identified as those documents which 
satisfy the Boolean logife' of the query. Then the ranking 
algorithm is employed to, compare the similarity of the 
document representation vectors of the retrieved i^ocuroents to 




10 



the query vector. The documents ate then rank ordered for 



output. In a recent study, the efficiency and effectiveness 
of this method were demonstrated (No^p^uZt et al., 1977). 

* 

This process is in contrast to linear associative processing 
retrieval systems (e.g. unclustered SMART) ( Spa rck Jones, 1973). 
In these systems, as in the case of.tvo-step systems using a 
ctftoff value,, the. nature of the ranking algorithms can affect 
the set of documents * the user received. Using a cutoff value 
places greater importance on the role of the ranking algorithm. 

REVIEW OF RELEV3>>NT LITERATURE 

While there have been numerous evaluations of document 

• / 

retrieval systems and different aspects of document retrieval 
systems, Sager. and Lockeman's (1976) view that a systematic 
evaluation (or even conceptual organization) of ranking 
algorithms has been absent from the literature is confirmed. . 

There are methodological reasons why 'definite' statements 
about" the " relative*p^rf ormances of ranking algorithms were not 
made. These will be discussed b^low. However, a significant . 
reason for the lack of l^otirledge is theoretical. That is, • 
until Sager 'and Lockeman's explication, the concept of ranking 
algorithms was not well enough defined to be carefully examined. 

It is easy to be critical of the methodology in information 
retrieval research. Without going through a case by case exam- 



lb 



11 

ination of past evaluation studies some recurrent problems can be 
'.pointed out. First is the problem of the. small size of data bases 
usually used in this research. This inhibits the general izability 
of results because the queries used in these studies are often not 
representative of queries that would be made of a larger data base. 
For example, consider a data base of 1,000 documents, and a query 
which retrieves 30 of those documents. If the dat4«base is a 
representative sampj.e of a larger data base, with say 30,000 
documents, then that same query passed. against the larger data base 
should be expected to retrieve about 900 documents not the kind 
of query often used in operational settings. An6t;her problem is 
that of human variables confounding system variables. Examples are 
poor indexer reliability, searcher inconsistency, and poor agree- 
ment between and among user and expert relevance judges.' 

For example, SUPARS researchers concluded 'that a large 

portion of the variance in syst^ performance may be due to factors 

ektrinsic to the system " the manner in which documents are defined 

as relevant and individual di^erences among Searchers "(Katzer, 

1971, pp. 38-39). The Comparative Syst^s- Laboratory group 

concluded that r 

... The difference in retrieval as exists 
^ -i between languages of equivalent length,*can 
almost entirely be attributed to human 
decisions in indexing and question analysis. 
In the study of differences in retrieval 
of relevant answars, where the relevant 
answers retrieved by index file C and . 
missed by fi? ^ wei^e examined, it was 
found that at least 75% of the incidence 
of missing can' be attributed to the human 
factor - to human decisions, idiosyncrasies, 
incondistencies, interpretations, etc. 
(Saracevic> 1968, p. 130). v 



12 ' 

Keen (1973) reported 42% inter-indexer consi& ency, ^ 
.77% intra^indexerVconsistency (after 20 weeks), 32% agreement 
among "relevance judges and that 69% of the documents judged 
relevant b/ a requestop were also judged relevant by expert judges. 

Other methodological reasons^for the lack of success of 
• evaluations to explain racing algorithms are 1) an examination 
of ranking algorithms has not been th#main goal of any empirical- 
research, besides Sager and Lockeman^s restricted effort, 2) in 
scsne cases the system variables have not" been isolated or controlled 
so as to determine if there are effects due to specific system 
components. This n^y be due both to the experimental design and/ 
or the nature of the dependent variables, (per fomance measures) , 
and 3) that yariafble% contributing to ranking algorithm performance 
have«.not been coAsidered in brpad contexts. That is, these 
variables must be consideted at different levels of component and 
environmental variables.' • . * • . 



Evaluations br cOTiparisons of total systems .are too general 
to allow conclusions to be drawn about specif ic -system ccmponents. 
This is particularly true for studies of operational systems, but 
true for experimental systems also. For example, in the original 
SMART vs. MEDLARS comparisom (Salton, 1969) wiy.le manual and 
automatic indexing were the focus of the comparison, other factors 
such as the form of the query, term weighting schemes and similarity 
measures may have had some effect on the results. 

Recall, precision an^ fallobt measures ma^be suitable as 
descriptive measures of a system's overall performance, but they 



80 



13 



are inadoquuio for i uv* ;,Li(^ ; t . .uis of the effects of specific' 
systems coiv^onontr.. Thr:t mj.i.mres are sensitive to variance.* 
in many sy^tern cc./api^ai . - : .ui^ it iz only with. the greatest 
'experiiiikLi.ur; • v70i.» \ t. measures can give testimpny to 

the perforioanco of . -^rtcra cr»mponen-t. , 



could ha^^^ . .^ vil ; . 
entirel V <i * i r-y^ i i 



Retrieval methods h i:nd B 
,»rcci^ic . . graphs yet be retrieving 
..mcnts. visual performance measures 



would not convoy t^bat r.rr.o r.ion^ which is of. value to a system 



Nan- 
relevant 



Pocument?: 




- MEASURE DIPFiaiLTY 



i t ivieved bjj System A All Documents • 

4 




Relevaii'. 

Documeni: ^ 

\ 



' jevcd by Sysfcem B 



The. niutit^ {,>:.- :.. . . rm is that '^f restricted rahge 

of invest ig.-.t icn. Vonnr\ : . naent representation, index term 
weighting, MrtilariM- -r. .-••i' aad query modifications have ail 

f » 

been t 1 HI. . ! n* •.wr.; , .in most instances, the other 

varxabjo'.; . re .<,|r ■ n ii >it"*"iriently considered (cf. Reitsma 



and Sagalyn, 1968; Minker et al., 1972; Cleverdon and Keen, 196«; 
Keen, 1973; Sparck Jones, 1973? Salton, 1975). This ^es not 
imply that such studies have not achieved results which bear on 
the performance of ranking algorithms. Studies of index term 
weighting (e.g. Salton and Yang, 1973; reviewed by Sparck Jones, • 
1973) show ambiguous results but indicate that inverse document 

ft 

frequency and discrimination value are valuable weighting factors, 
and ti:at tenn^requency and document length loay be' useful variables. 

studies of document representation have found significant 
performance differences due to index language variables (Saracevic, 
1968; Cleverdon and Keen, 1966; Keen, 1973) . A sample of results 
'from document representation studies indicate that uncontrolled 
vocabularies .work* as well as Controlled vocabularies, that single 
term languages are superior to other types, that there may be an 
ophimal depth for indexing languages and that machines and humans 
are generally better at judging relevance when- they are given 
more text to work with (e.g. .titie'vs. full text). In general, 
studies have found indexing languages to be a variable of minor 
importance (see Saracevic, 1968, pp.119-130). 

% * 

Yet Swets looked at 50 different retrieval methods over 
three different sy stems •( four different collections) and found 
that there were very sm»ll differences in the performance of , ' 
different retrieval methods within a collection as opposed to the 
larger performance differences between collections. These differ- 
ences are attributed to the difference in "hardness" of the voca- 
bularies in the subject area of the collections and to differences 



• 15 

s 

in the ways relevance judgements were^roade (Swets, 1967, p.28) . 

Studies of similarity measures have generally concluded 
that there are only minor differences aaiong their performances. 
The cosine correlation has become the preferred measure (Reitsma 
and Sagalyn, 1968; van Rijsbergen, 1975) . Still, conclusive tests 
of similarity measures for performance^dif ferences. have not been 
conducted. Similarity measure's have^been studied as measures of 
association in the context of clustering itigms in a vector space • 
rather than as a query-document matching function. 

On$ must regard all of these results with caution. Due to 
the restricted ranges, of other variables within which the key 
variables were tested, it. is unknown .if observed differences 
would remain consistant in different environments and if apparently 
equivalent methods behave differently in non-similar settings.^ 
In other yrords, diere might be interactions among variables 
complicating one's ability to discern key effects.. 



One is skeptical of "rib difference" findings. While the 
variables may not have a difference* on the employed perfor- 
mance measuifes, one might detect an effect on other dependent 
variables. (See example on Page 4', Figure 1.) There is likely 
to be a great deal of noise present in experiments. Given the 



t « 

Saracevic's (1968) study is somewhat of an exception - a 
step in the dirfection of the currently proposed work. In 
his study, the. variables "source of input" and "form of 
index language" were covaried. However, different term 
weighting schemes and similarity measures were not used. 



factors that influence recall-like nieasures, plus human v^iance 
in indexing an^ relevance assessments, it would not be surprising 
to find significant differences overwhelmed by noise. 

It Should be noted that a s'imul&ted document ranking and 
cutoff procedure was used in the Cranfield II. Study (Cidverdon 
and Keen, 1966) to determine if that procedure would affect 
the performance of index languages that were being studied J 
Unfortunately, the study was not a study of ranking algorithms. 
It was executed by hand, uaing only one ranking method, and was 
base^ on search co-ordina\:ion levels rather than textual* statis- 
txcal data. The data base consisted of 200 documents. A recall- 
based t>erformance measure was used which was not suited to 
, comparing ranking algorithms. 

Sager and Lockeman (19 76) defined the ranking algorithm 
composed of a query term weighting scheme, a document terra 
weighting scheme anh a similarity measure.^ This conceptual 
structure, as mentioned ^previously, is vital to the study of 
the ranking process. They identified some 22 term weighting 
functions for documents , 5 query term weighting^ schemes and 9 
similarity measures, yet this list was not exhaustive. Also, 
the algebraic relationships among rankipg algorithm components^ 
required further exploration. Lerman found that many similarity 
measures are monotonia with respect to each other. (Lerman, 
1970, cited in van Rijsbergen, 1975, p. 31.) „ 



4 They also included the fourth trival phase of placing the 
documents in descending order of similairity to -the query. 



17 

Sager and Lockeman (1976', p. 17) note -that "Ranking 
algorithms cannot improve the results of retrieval^ but pn^y 
those of display." They assume a two-stv,p model of retrieval 
(without ^utoff) as described earlier. Systems in which the set 
of documents given to the user consists of all documents having 
a non-zero relationship with a natural language query meet the^ 
above definitionl However, some'*^ ranking algorithms define at^ 
ordering ^uch that all documents have some relation to the query 
(e.g. distanc,e i'n a multidimerisioniil space. See Katter<i 1967;- 
Switzer, 1965) . In fact, the only situation in which rarJcing 

* •' t 

algorithms exist functionally independent of the retrieval set ^ 
formation is when the set is formed by' the search logic, ^nd ranking 
occurs afterward. Thus Sager and Lockema?^ focused on the process . 
of ranking the output from Boolean queries (this is not meant to 
exclude other logical operators) . 

In contrast to the previously mentioned support shown for 
ranked output, it has been argued that there -are logical fallacies 
iA the ranking of output from Boolean searches (Booksteia and 

< 

Cooper, 1976; BQokstein, 1977) and that ranking options have not 
been utilized by users en- systems which had them available (McCarn, 
1976; Rickman, 1972).'- - 

' The second point will .be dealt, with first . The ranking 
methods which went unused required considerable effort on the part 
of the user to manually assign weights or priorities to query r 
term^ cr to make other related" judgments. It should be noted 
that the mdthods explored in this study require no extra user 




ef i^ort beyond that which would be patt of a conventional Boolean . 
search. ' Also, until the 5IRE ranked output study fi^oreault et al., 
19*77} there was little' evidence that th'e output from Boolean 
searching crould be- effectively ranked according to probable 
relevance. 

As for logical perils/ Bookstein notes that any known, 
system of ranking Boolean search output has logical inconsistencies 
due to the fact that the same query could be represented by 

« 

different Boolean statements which would result in the same ranking 

t 

methcd producing diffexfent document order irigs in response to the 
same query (conceptually the same query). Also, inconsistency may 
arise from the fact that it is unclear how to deal with AMDS, ORS, 
and NOTS; specifically, does satisfying different requirements of 
the logic mandate different weights? What about the, abser.ce of a 
word when it is NOT supposed to be present? How much should that 
count? 



J 



These criticisms of ranking algorithms are* logically 
correct. However, the documents are not being ranked according 
' to their degree of agreement with the search statement'. Documents 
either do or do not meet the requirements for inclusion* in the 
retrj^eved set. The documents are then ordered along a useful 
dimension - in the case of the 1977 SIRE. study, by degree of 
similarity to a .vector composed of. the terms used in the query. 
Any criteria that seek to measure the degree of relation to some 
aspect of the information nffed, or in any way predict relevance 
are valid to explore and use. Bookstein correctly asserts that 



ERIC 



% 

to raiik documents on degrfee of cjonformity the logic of a 
Boolean query is logically, inconsistent. In fact, to do so 
without a probabilistic or fuzzy logic designed for that purpose 
would be logically incoherent, 

r 

Second, any ranking method is employed not as part of a 
scientific theory of meaning or logic, but as a pia^atic tool to 
aid in the satisfaction of an information need. Logical consis~' 
tency is not required of many human tasks? ssitisfactory performance* 
is required. 

The Noreault et al. (1977) ^tudy referred to is an example 
of an empirical examination of a ranking algorithm other than 
Sager's and Lockeman's work. Nojfeault et al. found that a completely 
automatic algorithm was able\ to/rank the output from Boolean searches 
effectively on probable relevance with no extra user effort and 
little incremental system cost. In that st.dy the environment of 
the ranking algorithm was characterized by Boolean queries created 
by an intermediary and a stemmed free text vocabulary from titles 
and abstracts with 150 common words removed. The ranking algorithm 
consisted of query terms weighted by their number of occurrences 
in the query, document terms weighted by their frequency of 
occurrence in the title and abstract, and the cosine correlation . 
as the. similarity measure. 

* 

APPROACH AND METHODOLOGY 
The approach taken here embodies a philosophy towards the 



20 

study of document retrieval systems. The study of information 
retrieval is in its infancy. There are fundamental aspects of 
computerized document retrieval systems about which little is 
known. Studies of overall system performance and user satisfaction 
are', of course, valuable. But similar emphasis needs to be placed 
on the functioning of Various system components. 

An emphasis on isolating .and testing specific system 
^omponents does not dictate studying each individual ccroponent in 
an isolated environment. One of the important aspects of this • 
'research Sesign was the plan to isol^t^, control and vary the 
levels of several system component va^riables at the same time so 
that main effects and interactions could be studied. 

Sager and Lockeman's three ccanponent model was expanded 

for this study to include two environmental classes of variables 

(Figure 1) , Just as it was important to vary the levels at which 

ranking algorithm component variables combine so that a statement 

could be made about relative ranking algorithm performances ' 

within the parti,pular environment in which they were tested, it : 

was* important to test the ranking algorithm combinations in 

various environments. 

• ' . 

For example, in .an environment in which all document 

rcpr:esentations are the same length, it makes little sense to 

employ a term weighting scheme or similarity measure that normalizes 

by the length of the document representation. Frequency with which 

a term occurs in a document representation is likewise not* a 



meaningful variable in a vocabulary of subject index, terms or 
classification codes assigned by indexers. - 

Unfortunately, there are too many variables within the 
five variable classes in the model and too many levels of all of 
these variables 'to enbompass in a single comprehensive test. 
Further, this is a study of the ranking process, not the entire 
retrieval process. So, some environmental factors have been 
simplified for the study. 

RESTRICTIONS 

The query form used in the study was Boolean queries. 
There are several reasons for this. 1) The study was designed 
to impact system designers working with the current state-of-the 
art. The vast majority of operational systems today provide for 

' ft 

_ ft 

Boolean searching. Thus, in terms of query form, our results 
should be generalizable to that population. 2) The study measured 
the effect of raiding algorithms on already formed sets. As ^ 
mentioned previously, the natural language systems used ranking 
algorithms tfo define these sets. 3) Methodologically, the 
measurement of the effectiveness of competing ranking algorithms 
becomes difficult if natural language queries are included. Natural 
language queries may retrieve sets of documents so large that a 
cutoff must ,be used to restrict the num|?er of documents the user, 
has to examine. This poses problems for comparison of retrieved 

I 

sets. Also, no query weighting schemes which required user 
assigned weights were used. 



22 

The study was performed on a coinmercially available data 
base. Current Index to Journals in Education. Document represent-, 
ation forms were selected from those existing on the data base. 

There is always a question about the generalization of 
results obtained from experiment? tion on a single data base.- 
Cooper (1970) warns information retrieval researchers about the 
e3«cessive number of variables to be considered. Swets' (1967) 
observation that the performance differences between connections 
was far greater thein those between different retrieval methods 
within a collection was noted earlier. One expects that there 
will be no dramatic changes in different collections in the ranking 
/Algorithms found most effective. The effects are likely to be 
attributable to factors reflected in the document representation 
variable. ^ However, replications in different collectipns wbuld 
lend credence to the stability of the results. ' • 

K ' 

~' ' ' ~ .. ' ' • 

METHODOLCX;iCAL REQUIREMENTS 

The nature of the dependent variables (performance 
measures) used to test the effects of the ranking algorithms is 
an important consideration. Any measures used must specifically 
measure the ranking algorithms* effect and not reflect other 
aspects of the system. The measure should test only the chafige 
in ordering due to the ranking algorithms. For ase of understanding, 
it is also desirable that the measure be a single' number rather 
than a curve. The Coefficient of Ranking Effectiveness (CRE) as 
described in Noreault (1977) was designed for this purpose. 



23 

t 

An essential factor in this study was Syracuse | 
Information Retrieval Experiment (SIRE) . Its augmented 
inverted file design (see McGill et al., .1576) allowed ^le 
two-step' processing of queries, using a variety of. rankin^g 

algorithms (QWs, TWs, and SMs). STAIRS has a con^arable- 

capability but is less flexible in this regard. Sag^r and 
Lockeman used STAIRS and were unable to vary QW or. SM or use 
seven of the 22 TWs they described (Sag6r and Lockeman, 1976, 
p. 18) , 

♦ * * 

OBJECTIVES OF THE STUDY 

♦ 

* ' 

1) To assess .the relative effectiveness of alternative 
methods of ranking the output l^rom Boolean queries, 
(that is, alternative methods df predicting the relative 
relevance of retrieved documents) . Specifically, 

a) To assess the effectiveness of various term 
weighting schemes (within and across DRs) ; 

b) To assess the effectiveness of various 
similarity measures (within and across DRs) . 

2) ^ To determine the- relative costs of implementing and using 
* each ranking algorithm (component) . 

3) . To determine^ specific file modifications necessary for 

conventional inverted -file systems to implement particular 
ranking algorithms. 



PROCEDURE 

Briefly, the procedures followeS in this study weres 

1) To secure the use of a suitable data base. The specifics 
of the data base will h described later in this report. 

2) To obtain the cooperation of- a suitable user population. One 
hundred seventy three interest statements were acquired. 

3) Review the term .weighting schemes and select a representative 
group. Twenty one were finally selected., 

4) Review the similarity measures both algebraically and by 
simtlilation to select a representative group. Twenty fou» 

- were eventually chosen for inclusion in the study. 

5) The statistical properties of the text were 'calculated to 
produce the weighting factors and similarity measure^. 

6) Characteristics of the da^ base were identified so that 
cost data could be acquired. 

7) Programs were modified as .necessary. 

8) The data base was loaded ^md preliminary data was collected. 

9) Intermediaries were trained to use the system. The intejf- 
mediaries were kept blind of the system's ranking abilities. 

10) Interest statements were obtained from users, and assigned 
to intermediaries. . ' 

11) The intermediaries translated the interest statements into 
Boolean queries. The interest statements were translated 
into the appropriate document representation. 

• • • 

12) Documents retrieved were merged and placed into a randomized- 
order. This list was given -to, the user for relevance 
judgements. ^ 



25 

13> Similarity values were computed between the query and the 

document. Documents were\then rank ordered- 
14) Ranking- effectiveness scores' were then calculated. 

.15)- Cost data for the ranking algorithms were assembled. 

- 

16) Differences and patterns in the data were searched out. 

17) t:onclusions were dra%m and> appropriate post-hoc comparisons 
were performed. 

. REVIEW AND SELECTION OF TWs . ' . 

This section reports on a review of the TWs found in 
IR literature and on the selection of a sample of TWs for 
inclusion in the experiment phase of the prcject. The experiment 
calls for the crossing of TW and SM in the environments of both 

• * 

DRs described above. 

♦ 

Since Luhn (1957) suggested that a term's frequency of 
occurrence in a document might be of value, in addition to its 
occurrence, about forty different TWs have been described in XR 
;^ literature. The previously most comprehensive list of TWs was 
provided by Sag^r and Lockemann (1976).' 

Studies dealing with term weighting have had a variety 
of purposes, including recall or precision enhancement, selecting 
"good" index terms, term clustering, and ranking effectiveness. 
The .TWs in these studies (see TW Bibliography) form the population 
from which the current work samples. 

Certain types of TWs were excluded from consideration. 

m 

\ 

These include manual weighting (e.g. Maron and Kuhns, 1^60), 

3;. 



« 

term classification scheroes (e.g. Sparck Jones and Jackson, 
1970), relevance weighting (e.g. Robertson, 1974), use of co- 
occurrence data (e.g. van Rijsbergen, 1977), and methods requiring 
complex estimation of distribution -parameters (e.g. Barter, 1975).' 
The first three types are not reasonably applicable tp state-of- 

• • • ' m 

the-art aut<»natic IR systems. The latter two' have potential 

ft 

aK>lication, but are excluded from the present study because of 
their complexity and the effort required" to implement and execute 
them on an operational system. , ' 

.Even with the restrictions above, over thirty unique TWs 
were found. These measures differ on three basic' dimensions: 
Ij The use of frequency information as opposed to binary 

(presence/absence) information about terms occurrences. 

TWs using frequency irformation are labelled "P" on Table 

1 below. 

2) Consideration of document length, (labelled "D"). 

3) Consideration of collection frequency, (labelled "C"). 

Table 1 contains a list of the major TWs considered. They 
vary as desc'ribed above, as well as in the measures used to 
represent the component terms, operators connecting the terms, 
and scaling factors. The TWs in the literature have been based 
on theoretical grounds; the terms of the measure k^e irelated in 
order to represent theoretically defined relations. Yet the matrix 

♦ 

•of possible permutations of the terms is rather well filled in. 
Thus it seemed appropriate to define some new TWs to fill gaps 



in the matrix. Also, some obvious simplifications are 
'.suggested. 

Table 1 includes a reference for each TW, where 
appropriate, and the TW's form on the three dimensions. Other 
comments about each TW are reported, including reasons for the 
TW's inclusion (denoted by an *) or exclusion from the sample 
. for experimentation. The sample was designed to allow ^ 
generalizability to the population of TWs identified. 

In addition to the specific results mentioned- in Table 1, 
Bomfi general tendencies ha^ve be^n noted. Collection frequency 
has been successful, while' the effects of within document 
frequency and document length have been ambiguous (Sparck 
Jones, 1973) , Both TW specific and TW class differences 
were examined in the present study. 



28 



TERM WEIGHTINGS' 



TABLE 1 



H 
M 
D 

K 



'm • frequency of tern 1 in document n. 

- number of types (unique terms) in docunent n. 
1^ • number of postings of term t. 

\^ - number of token, (term occurrences^- in document n, 

N 

- frequency of term i in data base. (- J f ) 

' n»l 

- number of document^ in data base. 

• number^ terms in dictionary. 

• number of postings in data base. (- [ d ) 

) 1-1- * 

• number of tokans in data base. J J f ) 

n»l i*l 



M 



FORMUU (W^ -) 
X) 1 



2) ^ 



3) 



log t. 



REFEI^CE 
Sager (1^76) 



Sagar (1976) 



10 t 

^ ™*^V t Bates (W77) 



5) 2- 



MAX (t ) 



COMMENTS 

"Unweighted." Simplest 
and most coooBonly ^aed 
method. 



Simplest conaidcca- 
tlon of document leqgth, 

D. Obvious transfon^^- 
tion to diminish effect 
of long documents. 



0. Integer formula. 
Where TOT CX) -next 
hi9host multipla of 
{O above X. 

D. Non^iatsgar varaion 
of #4. 



29 



TABLE 1 

REFERENCE 



COMHEKfS 



• 6) £ 



in 



* 7) l0g £ 



• 8) 



* 9) 



XI) 



in 



in 



log k 



S«S«r (1976) 



Sparck Jontm 
(1973) 



Sager (197#) 



10 k 



XO) £^p(XO-tntpt (^ Hi«(k " S> ) ^P^'^ck Jones & 

^ V . (1977) 



^n«2 - 



k^ 

MAX(0 
n 



F. Sinplesc conaidar- 
at ion of within 
document frequency 

F. Diminishes effect 
of nany occurreneea 
in a docuaent 



FD. Simplest consider- 
ation of within docunent 
frequency and doeumi^nt 
X«ngth (uMing fraqurncy 
inforaation). 



FD. Like ^8. but dInlnlHhe 
inpact of docustcnt length. 



FD» Saaie as #4 but using 
frequency inferaation . 



FD, Non-integar vara ion 
of #10. 



•12) -f. 
^'i 



Saagar (1976) 



C. SinpXcat consideration 
of collection frequfney. 



X3) 



Xog d, 



C. Sane as #12 byt 
diminishes inpact of 
high frequenciea. 



•lA) log(^) 



Saeger/(1976) 
h RobertaoQ 
(1974) 



C. Based .on the inforna- 
t ion content of a term 
about a docunent. Salton, 
Wong & Yang (1974) inter- 

prat thia aa log(#>) i. 

*1 



30 



FORKUIA (V^ •) 

111 

15) log I 



16) GI09 



log ^ 



TABtE 1 



CB 



Sparck Jo(t«0 
. .(1974) 



Sparck Jones 
& 'Baees (1976) 



C(»fMENTS 



C. Percent of postings 
belonging Co a csni. 



C. Integer formula 
for #14. Where G(X)»H, 
where 2*"'<x ^2"*, ' 



FC. 1X5 vlth fraquttficy 
information. • * ^ 



18) 



in 



P. 



Sparck Jones 
(1973) 



FC. #12 with frsquspcy 
information. 



^in T 
1 



Ssger (1976) 



FC. Available on IBM's 
STAIRS.. Increases with 
higher ratio of occur- 
rences per posting. 



20) 



in 



Ssger (1976) 



FC. Mixes levels, fre- 
quency information in 
numerator, binary in 
denominator. 



21) f* . 4 
in .2 

^1 



Ssgsr (1976) 



FC. Like #X9 but more 
scnflitivc to within- 
document fret^usncy and 
slso sensitlvs to high 
postings. 



31 




n -1 



* 27) - — ^ 



X^LB 1 



FORMUU (W^^ .) REFEREKCB 



COMMENTS 



* ^'^^^ ^^•^^^ Opposite effect of 

^ <'19. Increases with 

fewer occurrences por 
r posting. 

' **• 4][ . PC. with frequcncys 

infornation, but mixed 
levels. Sal ton, Vong & 

Yang (1974) use i^^'logi^Hl. 

ml - 

* ♦ 

* 24) ^f^ • losf*^) — 

In ' PC. 0X7 fully weighted, 

or less faa without 
Bixed levels. 

» 

• f 

•'25 — 



^1 : J'C. Like #18 but dlQlnlshes 

impact of collection 



f rectuency . 



• 26^ — 1 

' t d, Sparck Jone« 



/iOT'»\ combination. 
^*^^'^f Simple cohaideratioQ of 

document length and 
collection frequency. 



d^) " DC. Like &26 but 

diminishes effects of 
document length and 
collection frequency 



28) * . ___ 

'n — mentioned any 

where, but la sisplar 
thMi #26 or #27. 



r 



3^ 



t 



32 



TABLE 1 



FOlUiUU (W^ •) 
in 



REFERENCE 



C(»1M£NTS 



1 **1 

' t D 
n 



Sagcr (1976) 



DC. Differences between 
term's role In docua^nt 
and in the collection. 



30). 



Seger (1976) 



DC. Reduces to 



which is a linear 
transformation of #4(6. 



31) -S^ 



Sager (1976) 



DC. Poisson Standard 
deviate (#37) converted 
for binary data. 



32) 



In 

k F, 
n i 



Sparck Jonas 



FDC. 126 with frequency 
infomation. 



33) 



in 



108(k„F^) 



FDC. #27 with frequency 
infonaation. 



34) 



in. 



k F 
n i 



FDC. #28 with frequency 
informacion 



^in '^i 

35) -j^ - -j- Edmundson & FDC. #29 with frequency 

-A Wyllys (1961) inforaation^ Found 

effective in their atudy. 



33 



FORMULA (V^ «) 



36)- 



in 



TABLE I 

REFERENCE 



Edaundson & 
Vyilys (1961) 



COMMENTS 



FDC. #30 with frequency 
infonoacion'. Found 
effective In their study. 
X« linear transforMtlon 
of #32. 



37) 



Ednundeon & 
Wyily. (1961) 



FDC. Polsson Standard 
Deviate. Xs #35 with 
difference standardicad 
by an estlnate of 
standard davfatloa. 



38) 



In 



n 



Edffiundson & 
Wyllys (1961) 



FDC* Found inefftcelvc 
in their study. 



39) log- 



In 



n 



K 



Edmundson & 
Wyllys (1961) 



FDC. Found ineffective 
in their atudy. 



34 

REVIEW AND SELECTION OF SMs 

SMs that potentially could be used to rank documents 
which have been retrieved by a Boolean query include any function 
which assigns a number to a pair of vectors based on their 
siipilarity. Selecting a representative sample of SMs presents 
more difficulties than does the selection of TWjS. The main 
reason for this is that very few of the SMs advocated! for use 
in IR have been created for the purpose of measuring thfe similarity 
of documents to queries, and very few actually have been used to 
rank order documents for output. 

The SMs reviewed here 'are those that have been proposed or 
used for some IR activity, or are closely related to such measures 
in form or by reference. Many of the SMs come from the field of 
Numerical Taxonomy. Of these, some have been used for various 
purposes in IR, such as computing the siinilarities between terms 
or between documents for clustering. 

Before selecting SMs for experimentation, it was useful 
to assemble a list of potential SMs. Of course, this list is 
not exhaustive, since the SMs come from such a diversity of areas. 
It was meant to be comprehensive in terns of those SMs mentioned 
in the context of IR, with certain restrictions. Certain types 
of SMs were excluded, namely those which are explicitly for 
measuring the similarity between groups (e.g. clusters of 
documents^; measures requiring the changing of the nature of the 
"attribute space (e.g. measures dependent on a factor analysis). 



. 35 

t 

measures, that place each item in a category rather than assign 

« 

a score- to each item, 'and iterative methods. This is not a 
major constraint since it leaves within our domain a rich 
population of ais to which we can generalize. ^ 

A list of sixty-seven SMs is in Table 2. SMs marked by 
a* + or an @ were used in the clustering analyses described 
below. SMs marked by an * were selected for the main experiment. 



t 



B 


m 


denotes binary measure 


♦ 


m 


used 


in simulation 1 , 


f 




used 


in sinulation 2 


* 


m 


used 


in sqain experiment 




m 


0 if 


> 0^ 1 otherwise 




m 


0 if 


> 0^ 1 otherwise 


FORMULA 


1 



•SIMIUVRI Tl 



TABLE : 



I) .S_ - 



EX Y 
11 



2) S • EX,Y^. 
xy 11 



*9f 4) S « 
xy 



E(Xj-X)(Y^-Y) 
E(X^-X)^ • E(Y^-Y)^ 



B S) S 



ad - be 



^' (a+c)<c+d)(b+d) 



ERIC 



weight of tera^i In Che docuoent (X) 
weight of tera 1 In the query (Y) 

nirtaber jof terrs in dictionary 

denotes similar ity measute 

denotes dissimilarity tseastires 
REFERENCE COMMENTS 



Torgerspn (1958) Cosine. 



Overall and 
Klett (1972; 



Vector or inner product 



Overall and 
Klett (1972) 



Mean Cross Product. 
Monotonic with Q2. 



\ 



Sneath and 
Sokal (197^) 



Pearson Product Moiaent 
Correlation. 



rneath and 
Sokal (1973) 



Correlation for Binary 
Data. Equivalent to #4 



fORHULA 



•^B 12) S - ^4^-1^ 
. >y ad + be 



(^-ex,y^.ix;y;*:xjy^.ix^y; / 



B 13) S - *^ " 



♦B 14) s - £!±±:£ 
«y M 



REFERENCE 



COMMEKTS 



Jones & Cure if 
(1967) 

Karon & Kuhns 
(I960) 



Karca aKi Rutins'. 



Sneath and 
Sokal (1973) 



Tule*s. Numerator Is 
det^nninanc of £ * 2 
■atrix. \ 



J 



Tague (1966) 



Ponsula for converting 
binomial variable to 
standard normal form. 
E<|uivalent to #1. 



Sneath and 
Sokal (1973) 



Baoann*s. Found monotonic 
with 17 in binary 
simulation. 



00 



IT 



FORMULA 

_ 2a ^ 



15) $ 



(■. 



4B 16) S . r-4- 
«y b + c 



-fB 17) S ! 



1 "l^l gX^Y^ 



ERIC 



18 



REFEREN'CE 

Snoath and 
Sqkai (1973) 

* 

Leraan (1970) 

teraan (1970) 

r 

Uroan (1970) 



COMMENTS 
Dice's SM 



Kulzynskl's. Found 
■onotonlc vlch ^IS in 
binary siaulatlpn. • 



Sofcal and Sneath's. 
Found Bonoconic with #15 
in binary simulation. 



Kulzynskl's. Arithmetic 
sean of shared percentage 
of X and Y. 



i 




fORMlXA 



i 19) S 



( 



/ex? . IT? 



1 20) D - V 

ly 2a b -I- c 



A 21) S • ~ 
«y K 



22«) S 



(• 



TABLE 2 



50 



REFEREKCE 



Uraan (1970) Ceosetrlc oean. Modified 

correlation coefficienc. 
Equivalent to #1. 



van Rijsbergen Distance conversion of 
W975) Dice's SM. (115). 

• Monotonic vith #15. 



I^rman (1970) Russell and Rao*s. 

Equivalent to #2. 



Jones and Recall of Y for X. If 

Curtis (1967) |«4c| < |a-l-b| then it 

is equWalent to #27. 
For binary data is 
Bonoconic with #1. 



fORMLTA 



♦B 22b) S - — ^ 



( "i^ 1 



ay a ♦ b' c' 



B 24) S • 



Ma 



«y (a + b)(a + c) 



■ ^5) - - log , ^ J},^ ^ , 

xy * (a + b)(a ♦ c) 



26) S - 7 —J 

' Ky (a ♦ b)(a + c) 



ERIC 



RCFERiESiCE 



COMMENTS 



Jones ani 
Cdrcis (1967) 



Recall of X for Y. 
If 4iCH:| > |a+b| 

then It la equivalent 
to #27. For binary data 
is Bonotonic wltll #2. 



Jones and 
Curtis (1967) 



General fom. 

/ 



Ball (1965) 



Koche^and Wong. 
Equivalent to #1. 



Ball (1965) 



Abraham's. Equivalent 
toll. 



Jon^s and 
Xurtis (1967) 



TABLE 2 



fORNUIA 



•i 27) S - > L„ 

xy Bin <;x\ :y ) 



1^. - ? ' 

28) s^-n.-A-^ 



nin <X,. YJ 

W (X,. Y.)' 
•§f 29) S - ^ ^ 

«y N 



^ V 2 < M 2^2 > 



2Mx^c.w. - S(X, - Y,)^ 
• §f 31) S - i 

2M¥ - ^fX - Y )^ 



ERIC 



i 



REFERENCE 



COMMENTS 



Sager and 
Lockexaann (1967) 



Overlap. If |a+b| > |«-fc| 
Chen ic is equivalent Co 
#22a. If [a+bj < |a+cj 
then ic is equivalent 
to 22b. 



Sneath and 
Sokal (1973) 



(rover. • che maximum 
value of term i, in any 
document . 



Reitssa and 
Sagalyn (1968) 



N * number of shared 
cerms. For binary data 
Is equivalent to #2. 



Sager (1976) 



Bennet and Spiegel. 

N ■ number of documents 

in collection. 



Sneath and s CatCell*s Patcem. 

Sokal (1973) Similar it v. Found 

vich #7 io binary simulation* 




FORKUIA 

0 



tX Y 

♦B 33) S - * ^ 



B 34) S 



icy a "f b + c 



B 35) S - ^ 
My • ♦ b 4- c 



EX Y 

36) S • * * 



IX? ♦ lY? - EX.Y. 
i 1 i 1^ 



I 

* 



TABLE 2 



ERIC 



5b 



REFERENCE 



COMMENTS 



Reitsma and 
Sagalyn (1968) 



Average weight of shared 
terms . 

B • 1 if Xj > 0 and Yj > 0, 

* 0 otherwise. 
M ■ number of shared terms* 
for binary vectors is 
equivalent to #2. 



Reitsaia and 
Sagalyn U968) 



Parker-Rhodes Needh^a. 
Found monotcnic with ^15 
for binary simulation. 
For binary data, is 
equivalent to #34. 



Sneath and 
Sokal (1973) 



Jaccard's. Intersection 
divided by L*nir>JQ, For 
binary data is monotcnic 
with #15. 



Tague (1966) 



Doyle* 8. Equal to #34. 



Sager and 
Lockemann (1976) 



Taniffloto's. For binary 
data is equivalent to 
#33 and #34. 



FORMULA ^ 



4- 37) S * * 



jty N 



N - N , N 



♦B 38) S ^ " 



M-maxflX.Y. ♦•:x;y,. JX.Y. + IX.Y") ■ 

11 11 1»1 .11^ 



♦B 39) S - ^<"' : 



»y „2 



/ 2<EXjYj • txjYj - :xjYj • rXj^Yj) ,\ 



*e*B 40) S • 2(ad - bc> 
^ xy M(2 a b + c) 



( 



2(£XJ^ • rXjYj - rXjY^ • EXjYj) \ 
. M(2EX^Y^ 4 ;x;Y^ * EX^YJ) ^' 



ERIC 



TABLE 2 



REFERENCE 



C»fHEKTS 



Reitsma and 
Sagalyn (1968) 



Interpret 3t ion of #33 
for urelghted vectors, 
Monotonic with 115. 



Kuhns (1965) 



Rectangu lar. d lstance_ 

aibove independence. 
Found Qoneconic with 91 
in binary simulation. 



Kuhns (1965) 



Separajcion above 
independence. Monotonic 
vith #11. 



Kuhns (1965) 



Coef f ic lent of^the 
Artihsie tic Mean 



TABLE 2 



rOKMULA 



lEFEREKCE 



COMMENTS 



EX Y • ^X*Y* 



- rx!Y 



i i 



-^B 42) S • 



ad - be 

M • mln(a+b - ^ . iS^) 



«1^1 • 



j:x;y! - rxiY 

A 1 



11 



EX.Y 



1*1 



ax.Y, * :x!Yj'' 

M • Bln(IX.Y, + EX!Y^ - i-5 L±_ rv v 

1 i 1 1 K • i.*^T^ 



Ktihos (1965) 



Kuhns (196S) 



Angle between vectors 
above Independence. For 
binary data la aonotonlc 
with II. 



Probability dli.^rence 2 
above Independence. Found 
■onotonic with #11 In 
binary simulation. 



♦B 43) S - 



ad - be 



M • aaxCa-f-b - 



i^,a.c-I^) 



^Vi ' ^x^Y^ ^rx^Y, > ex.y; 



Kuhns (1965) 



„ (EX.Y, ♦ rX!Y,)^ 

11 • iaiax(rx 4 EXJY^ - - \ ^ ^ ^ * 



(EXJ^ ♦ EXjY*)^ 



Probability difference 1 
above dependience. 
Found nonotonlc with #7 
la binary simulation. 



) 



GO 



TABLE 2 



rOHMLTA 



ii£FEREN'C£ 



Kuhns (I96S) 



CGMMENTS 

Tule's coefficient of 
colligation. Found 
▼ery similar to ^12 in 
both sinulations. 



♦B 45) S 



ad - be 



xy M . minCa + b, ,a ■♦• c) 



7 

M-oinCX^Y^ + :Xp-^, EX^Y^ + ' Vi V 



m 46) S 



ad - be 



Hil - 



a 



(a+b)(a+c) 



(2a+b+c - 



(a+b) 



Kuhns (1965) 



Kuhns (1965) 



Conditional probability 
above independence. 
Found monotonic with #11 
in binary simulation. 



Proportion of overlap 
above independence; 



EX^Y^ • EXJYJ - EXjYj • rX^Yj 



jj-»-:xjYj)(:XjY^ + ix^Yj) ; 



M(l - 



rXjY^ 



(EXjYj -f ix;y^)(ix^y^ + :x^Yp> ^^'h^i * ^ -h^l 



(£X 



) / 
/ 



FRIC 



63 ^ 




TABLE 2 

FORMUIA 



B 47) S ■ , 

xy (a-»-b){a+c) 



(- 



ix,T^ «'YI rx;T. ■ tx,!?; '^ 



+8 48) S -MlnM^ alna blnb cinc + 

dlnd - (tt+b)ln (a+b) - (a-h:)ln (a-K:) 
<b4tl)ln ih+d) - Cc+dUn (c-Hl) 



4B 49) S • CMlnM 4 alna + blnb + cine + 

dlnd - (a+b) In (a+b) (a+c)ln (a+c) - 
<b+d)ln (b-Ki) - (c+d)ln (c-HI)j/CMlnM 
alna - blnb - cine - dlndj 

♦fB 50) S * y/l ~ (1 > KXt.V))2 




REFERENCE C(»M£NTS 



Kuhns (196S) 



^44 



Index of Independence. 
For binary data is 
■onotonic with #1, 



Sneath and 
Sokal (1973) 



KuCual Infort)»tion of 
X and y. I (X;Y^ equals 
I (X) + I (Y) - 1 (X.Y) 
where I (X) - inforaation 
oajX, I (X,Y) - joint ' 
infomacion on X and Y. 
Not readily applicable 
to weighted vectors. 



Orlocci (1969) 



Ratio of mutual to Joint 
information. Equals 
I(X;Y)/I(X.Y). Not 
readily applicable to 
weighted vectors. 



Orlocci (1967) Rajski's Coherence 

Coefficient. Not 
readily applicable to 
weighted vectors. 



FORMULA 



N n m ■ 

51) S . ^ 1-1 



N n « M - 

k-l i-1 k-1 




*ef 52) S 



xy 



100 i^5(Xj.Yj) 
log P(i) 



10 EX^Yj 
log P(i) 



log P(i) • 





60 



CQHMEMTS 



. Where: !f • number of 
documents In collection, 
n " number of cerrs in 
document (X). a - ouzber 
of terns ia 0«ery (Y). 
Ijl^ ■ frequency of docu- 
ment (X)*s 1th tern in 
document K. Angle between 
Average term of dccurent 
and average term of query 
over space defined by 
documents. Requires ex- 
cessive computation. 

Mbere 

B • X If > 0 < Yj 

0 otherwise. P(i) - 
number of postings of term 
i. For binary data is 
equivalent to #16. 



TABLE 2 



FOf MUIA 



EEFERENCC 



♦B 53) 



M(!Ma • <a+b)(a-h:)' - |) 
*«y " («*b)(a*c)(b+d)U"Ki) 



(■ 



M.2 



(rx + ::xj Yj) cx + zx jj) (zxjy^ + ixjYp (ix^Yj; ^ ixjy • )j 



Jones and 
Curtis (1967) 
Keitscia and 
Saga?vn (1968) 



# 54) S 



4M(| 



NEX^ 
144 



lY 



144 



rxj iyJ Ixf 

<14l> <14l> - 14l> - ut> 



rY 



Reitsoa and 
Sagalyn (1968) 



Stile*s. For binary 
daCa is equivalent to 
15 and #4. 



Their interpretation of 
#53 for weighted vectors. 



« 55) - /I(X^ - Yj)-^ 



ERIC 



Sncath and 
Sokal (1973) 



Euclidean distance. 
Hinkovski 2. Equivalent 



to rxJ-»-rYj-2/Exj'2:xt • cos 

11 i i xy 

and rxj + iyJ 2j:x^y^. 

Found oonotbnic with #31« 
in both siaulations. 



FORMUIA 



57) D^-i nx^-yJ 

58) - 1 E(X^ - Y^) 

VkX, - Y,)^ 

59) D - y ^-^ 1^ 

xy M 



6X) 0^ - [S{Xj - Yj)'j*/'' 



70 




REFERENCE 



CaMMENTS 



Sneaeh and 
Sokdl (1973) 



City Block distance. - 
Hinkowski 1. For 
bleary data Is equiva- 
lent to 155. 



Sneath and 
Sokal (1973) 



Kean Character Difference., 
rquivalent to #56 c 



Sneath and 
Sokal (1973) 



Average Distance. 



Sneath and 
Sokal tl973) 

Cormack (i971) 



Sneath and 
Sokal (1973) 



Euclidean Distance Average, 
Equivalent to #55. 

\ 

General Euclidean Distance 
*FonB. V can equal l,.or 



1 ' 



for all J. 

General Minkowski Form. 



7i 



O 



FORMULA 

e2) s - — = i L_ 



|X - T « 
•e*- 63) D - E(-7-i 11) 



X - y 

*§f 64) 0 - 1 E(Ji i)2 



65 D 



b + c 



xy 2a -f b -t- c 



♦B 66) S - b - c 

xy a-f-d-f-b-t-c 



67) S • ♦ 

xy « 2b -f 2c <f d 



REFERENCE 



CCMMESTS 



Consack (1971) 



Coefficient of Nearness. 
Equivalent to #55. 
Found monotonic vith #7 
in binary simulation. 



Sneath and 
Sokal (1973) 



Canberra Distance 

For binary is equivalent 

to #7. " 



Sneath and 
Sokal (1973) 



Coefficient of Diver- 
gence. For binary data 
is iBonotonic vith #7 
and #63. 



Sneath and 
Sokal (1973) 



Nonatetric Coefficient. 
Equal to #20. For 
binary data is equiva- 
lent to #20, #15. 



Lenoan (1970) 



Baman's. Equal to #14. 



Sokal (1973) Monotonic with ?9 S 

and #7. 



52 



SM. Barked by • B ar* designed for w on blnery voetoro. 
Binary aeaturee ere described In Urmt of tho tw-bytwo tebU 
•hoim beiov. (Teble 3). 





I 


0 




1 


a 


b 


n 


0 


c 


d 






tx 







where » 



M 



a • 


I 

l-l 




/. 


H 

I 
1-1 






M 




e • 


I 
l-l 


*ri 




N 




d • 


I 

1-1 


X' Y* 
*1 ^1 



7 



Y* 
^1 



I If Doe « eoneaiaa cam 1 

0 othervloe 

1 If X. 



0 If X. 



0 
I 



I If Query Y eontalaa cer« i 

0 ochvrwlee 

1 If Y. - 0 
0 If YJ - I 



M - nuBbcr of cerma in index 



0 

ERIC 



Belcv oany of cha binary aeaaurea appaara a gcnarallred 
version Intended for ueo on weighted vector.. The tran.latlon la 
not elvoy. obvtou. (cf. Reltema end Sagalyn. 1968). i« each caaa 
the weighted verelon wea constructed so that the binary measure la 
a-spectnl esse of the weighted verelon and so that the chsrscterlaUa^ 
of alallsrlty being measured are preaerved aa ^ch aa possible. 
The first point means that when spplled to binary vectors, the 
binary SM and Its generalised counterpart produce equivalent 
valutN. The second point refers to the intended behsvlor of tha 
SM. For example, a b c d could be Interpreted aa a con- 
sisnt, M, or as an additive function of the vector lengths, |x| 
snd |y|. In such cases, an atteapt was asde to preserva tha in- 
tention of the binary SM. 

. 74 



53 

A reference is also given for each SM. This reference 
is either to the measure's introduction or to a discussion or 
analysis of the measure. Preference was given to more accessible 

m 

sources* 

The SMs listed in Table 2 mayL be divided into four types 
following the typology of Sneath and Sokal (1973). "Association 
coefficients" work with qualitative data and measure some 
variant of the amount of agreement betweeii the two items. The 
binary SMs discussed above fall under this heading. 

"Correlation coefficients" covers such mcsasures as the 
Pearson Product Moment Correlation and the cosine correlation. 
Such SMs measure proportionality and departure from independence. 

"Distance measures"' obviously measure dissimilarity, 
ffinaller values indicating greater similarity. The notion of 
distance implies a space in which distance is measured. The 
distance measures described h^re do not necessarily obey the 
rules of Euclidean spaces. Distance SMs are denoted by a "Dxy" 
instead of an "Sxy" on Table 2. 

"Probability coefficients" treat the likelihood of 
agreements as well as the presence or'^^amoiint of agreement as 
consi(^ered by the other measures. Suct^Ms include information 
theore\:ic values. 

After compiling this list, it was desirable to narrow it 
down to a group of approximately twenty for the main experiment. 



t 



.ERIC 



54 

Comparison by inspection and simple algebraic manipulation 
enabled some reduction of the list. Some formulas were found 
to be identical, having been described in the literature by 
different names or different terminology (see example No. 34 and 

m 

No. 35). Other pairs wer^ found to have joint monotonicity, 
(i.e. the rankings they produce are identical). SMs No. 3 and 
No. 2 are examples of this. Many relationships among SMs were 
more subtle. Two simulation studies were run to identify, 
a) pairs of SMs which are monotonic even though this might not 
be obvious through algebraic comparison, and b) clusters of SMs 
which tend to produce similar rank orderings of documents. 

The first simulation used binary information about the 
presence of terms in documents (i.e. TW No. 1). This limited 
the conclusions that could be drawn from this study, but simplified 
»it considerably. An aim of this simulation was to decrease the 
number of SMs considered to be unique. Specifically, if a measure 
which was intended for use on binary data was found to produce 
identical or similar rank orderings to another binary SM or to 
an SM intended for weighte*d vectors, then a binary SM could be 
dropped from further analysis. In the first case, there is not 
sufficient reason for translating an SM beyond its intended 
application if it is not even unique within its own sphere. In 
the second case, the binary SM is found to be a special case of 
a more general SM. 

The simulated products of these two simulation studies 
were the document orderings that would be produced by the various 



55 



SMs. For the first simulation, an artificial binary term-by- 



simulated sample of fifty "documents" and fifty "terms". The 
presence/absence data was created using rawjcro^'numbers . The 



numbers were drawn from distributions approkimateXy the same as 
those observed in the free text DR of the CUE data base. Thus/ 
the parameters describing the indexing breadth and depth of the 
simulated sample approximate those of the CIJE data base. 

The first simulation consists of submitting "queries" 
, (artificially constructed to have the same distribution of words 
as would be found normally) , and determining the rank order ings 
of the fifty documents that would be produced by the various SMs. 
These orderings were compared by computing rank order correlations 
between the ordered lists produced by each pair of SMs. These 
correlations were averaged over fifteen simulated queries. 

The pattern of correlations can be seen in the graph 
below, (Figure 4) . 



document matrix was constructed. This matrix consisted of a 




CORRELATIONS BETWEEN SMs 
BASED ON BINARY SIMULATION OF 15 QUERIES 



FIGURE 4 



56 



( 2^2b,29,32 )^ 
(7, 14, 31, 43 >62, 63, 64, 66, 6?) 





* Each set of circled number (s) represents one unique SM. 
Edges represent correlations > .7. 



* 



4 



1% . 



57 

These correlations depart slightly from the central aim 
of the study. They are based on a ranking of all fifty 
(simulated) sample documents, whereas the main study is only 
concerned with ranking documents which fulfill the logic of a 
query. In practice this means ranking documents that at least * 
have one term in common with th$ query. Some SMs differed in 
this simulation only on documents which would not normally be' 
retrieved, in these cases, the SMs were gs:ouped as equivalent. 
The effect of this difference is that the cor^ations are 
generally lower than they would fa^ under retrieved set ranking. 

Some SMs (Nos. 6, 26, 48, 49, 50, 51, 58) had 
conspicuously low (near zero) correlations with all other SMs. 
As a result of the analyses to this point, twenty-nine unique 
SM types may be described. These are listed in Table 3. Some 
SMs listed as unique were found to have very high or perfect 
correlations with other SMs in the binary simulation. SMs 
designed for weighted vectors were retained because they might 
differ more when weights are used. Note that SMs such as Nos. 
40, 44, 12 and 4 (5, 53) are considered Unique at this point, 
despite high correlations, because of potential differences on 
weighted vectors. 



UNIQUE SMs AFTER BINARY SIMULATION 



TABLE 3 

_ (Others Monotonic With It) 



1 
2 
4 
6 

7 

11 
12 
26 
27 
28 
29 
30 
31 
32 
40 
44 
46 
48 
49 
50 
51 
52 
54 
55 
56 
58 
63 
64 



(13, 19, 22, 24, 25, 38, 41, 47) 
(3, 21, 22b) 
(5, 53) 

(8, 9, 10, 14, 43, 66, 67) 
{39, 42, 45) 



(15; 16, 17, 20, 33, 34, 35, 37, 6 



(18) . 

(59, 62) 
(57) 



* Found monotonic with No. 55 in weighted simulation 



59 

In Table 3, the SMs presented outside the parentheses 
are either unique, or represent several SMs which are 
equivalent, but, as a group ere unique from other SMS. The 
SMs representing groups were selected on the basis of gen- 
erality (general over Boolean) and comp£tational simplicity. 

■\ 

m 

A second simulation study was performed using frequency 
information (TW No. 6) instead of binary information (TW No. 
r) . The frequency information wi^s added to the f if ty-by- - 
fifty matrix used previously in such a way so as to model 
the term frequency distributions of the CIJE data base. 
Again, .the correlation between the document orderings created 
by the SMs- were averaged over fifteen simulated queries. 

f m 

This simulation study looked for relationships among 
the heretofore unique SMs. The correlations obtained in 
this study were, much lower? very few were as great as 0^.4. 
Based on these low correlations, we may conclude that the 
document orderings produced by these SMs were different 

-TV, 

from e^ch other. 

TWO relationships were observed^in the second simulation 

which helped select a sample for later experimentation . , 

SM No. 5, which was not included in the first simulation, 

was found equivalent to Wo. 31 Also, SMs No. 12 and No. 44 
\ ^ . - • 

aga^n displayed a very strong correlation (> .9). 



60 

,This left twenty eight SMs in the sample. For reasons 
such as high correlation or computational complexity, some 
of these SMs were excluded from the sample used for further 
experimentation, bringing the sample size to twenty-four. 
Inclusion and reasons for exclusion are noted in Table 2. 



DESCRIPTION AND LOADING 



i 

OF THE DATA BASE ^ 



INTRODUCTION - 

This section describes the data base and each of the 
document representations used for this study. The data base 
is a subset of the ERIC CUE (Current Index to Journals in 
Education) . The document representations chosen were ( I) 
terms from the title and annotation, and (2) the ERIC 
descriptors for each document. These representations were 
selected as representatives of those available in commercial 
data bases . • 

Additionally, this report presents a comparison of the 
CPU use and storage requirements for the loading of the data 
base, ^his is the computer costs for c:onstruction of the 
dictionary to make the system available online. None of the 
computer or labor costs involved in the construction of the 
data base are included. 

Description of Data Base 

The data base for this project consisted of 10885 
records from CUE. The selected records are from four 
clearinghouses:^ Tests, Measurement and Evaluation (TM) , 
Information Resources (IR) » Educational Management (EA) , and 
Teacher Education (SP) . At the time of acquisition (August 



62 

1978) these were the most recent records avaiXabe from ERIC 
in e^h of the clearinghouses. 

The distribution of the records ai^ong the < l^aringhouses 

is: 





% of Total 


Records 


Eh 


31.8 


3461 


IR 


. 26.9 


2928 


SP 


28.8 


3135 


TM 


12.5 


1361 




TABLE 4 


y 



RECORDS IN DATA BASE BY CLEARINGHOUSE 

These reflect the proportion of the ERIC CUE data base, 
developed by^each of the clearinghouses. Records were selected.- 
by identifying those developed by the four clearinghouses 
over the period of the previous 24 months. No other selection 
criteria Were used. 

Each of the document representations, controlled des- 
criptors and free text, was used to create a separate inverted 
file. The free t^rms, from title and abstracts, \ifere compared 
to a stop list containing about 150 common terms and then the 
remaining terms were stemmed (TARS, 19 76) . The stemming 
algorithm reduces the number of free terms. This in turn 
reduces the need to identify all work variations for 
retrieval. Controlled descriptors were used as developed by 



63 

the ERIC professionals. For purposes of system compatability 
imbedded blanks were removed. Each controlled c^^scri^br was 
truncated at 24 characters. Thi& insured the uniqtieness of 
each descriptor. This saa^ process was conducted at the time, 
of the search and was transparent to the intermediary. 



Statistics describing the characteristics of the data base 
for the controlled and free terms are presented in Table 5. 



Controlled Free 
Average number of terms in a record 6.45 ' 20.39 

Average number of unique terms in 

a record 6.45 16.77 

Average number of postings in a ^ arm 18.21 17.62 
Number of unique terms 3855 10 361 

TABLE 5 

DESCRIPTION OF DATA BASE BY REPRESENTATION 



Construction of the inverted Files 

As noted earlier, an inverted file was constructed for 
both the controlled descriptors and free text document 
representations. These files were constructed using SIRE 
(Syracuse Information Retrieval System). SIRE (McGill et al, 
19 76) was developed at the School of Information Studies for 
experimental use. This section will explain the process used 
in constructing the inverted files. 



64 

SIRE uses a three- step process in building the inverted 
f ij/- . 

1. Each record i^ processed sequentially, producing a 
dictionary of the terjns in that record. The output 
from this .step is a file in which each record consists 

of a term, frequency of the term in document, and document 
number, other sn^all files are produced at this point 
which contain pointers and document length. The SIRE 

program which accomplishes this is READIN (See Figure 5) . 

— > 1^ 

2. The next process is to sort the file produced by READIN 
into alphabetic^ order oh the term field. This s,tep is 
accon^lished by two programs, SORT and MERGE. (See 
Figure 6) . 

3. The final step is to use the sorted file t<^ produce* an 
inverted file. T^e program to accomplish this is 
MMCDIC. (See Figure 7) . 

O • 

These processes were used to construct the inverted file 
for t be free terms. Modification of the process was required 

* 

for the controlled terms since SIRE was designed to hardle 
individual terms with a length of up to 12 characters. Since 
controlled terms were often phrases or combinations of words, 
more than 12 characters were needed to assure that^ each con- 
trolled term had a unique representation. This required 
inserting a new step between READIN and MERGE/SORT which 
converted the controlled terms into codes and constructed a 



. 65 

conversion table. This conversion program is to be called 
CONNUM. 

Comparison of Construction Tiroes 

The CPU usage and total cost for each of the programs 
used in the construction of the 'inverted ^f ile are provided in 
Table 6, These figures are most useful for comparison of 
relative costs of the two document representa'^ions . Actual 
costs are dependent on the particular configuration and. 
characteristics of the computer installation. In particular, 
the cost of this SORT procedure is inflated since the SORT 
and MERGE were written locally and are less efficj.ent than 
commercially available packages. All programs are written in 
SAIL (Stanford Artificial Intelligence Language) , an ALGOL-60 
variant, on a DEC- 10.' 

•^ 

The comparison of the controlled and free text shows 

that the controlled is less expensive to build and store. 

Specific«?: iy, only 60% of the CPU time and 90% of the space 

to store the dictionary were used for the controlled descripto 

» • 

V 

as compared to the free text. 

The control '^d vocabulary is a less expensive repre-^ 
sentation., in terms of computer usage, than is the free text. 
Computer costs for building and maintaining a document 
representation based on free text would be greater than those 
for a document representation based on controlled terms. 



66 



REAOIN 
CONNUM 
SORT 

MAKDIC 



a 



Total 



Control led 
210 
267 

264.58 
44.69 
7.66 
793.93 



Free 
460.33 

667.16 
169.87 
16.97 
1314.33 



CPU tiine 
.In secondjs 



TABLE 6 



PROCESSING TIMES FOR PILE CONSTRUCTION 



Co*; -rolled 
Free 



Words) 

Total 
1,101,062 
1,319,062 



File Sizes (36 
V ariable Fixed 
85,632 1,105,430 
213,632 1,105,430 

TABLE 7 
COMPARISON OF FILE SIZES 

The fixed file size includes pointer files and th6 
CUE data base source. The variable file size is the inverted 
file, which will be different for the different representations. 



67 



FIGURE 5. 



OUTPUT FROM READlN 





^recf 


Document 1 


• 

• 
• 


• 
• 

Freest 


• 

• 

• 

Document 2 


• 

. • ^ 
Term 


• 
• 

Freq- 


m 
• 

Document 3 


• 
• 

• 

_ Term 


• 

* 
• 

Freer 


• 
* 
• 

Document N 


• 

FIGURE 6^ OUTPUT PROM M 


Term 


Frea 


Document No. 


• 
• 
• 
• 

Tern 


* * 
• 
• 
• 

Frea 


• 
• 
• 

Document No. 



File is in order of increasina 
^ document number. Within a 
docun^nt each teem occurs only 
once and records are in alphabetic 
order on tarms. 



File from Figure ias been 
sorted into increasing 
order on terra* 



FIGURE 7. 
Index 



OUTPUT FROM MAKDIC 

Inverted File 



Term 


tio.of Terms 


Pointer 


• 
• 


• 
• 
• 


• 
• 

• 



Term 


Posting"" 




Sta» 


• 
• 

Term 


• 

• 

Postina 


I 


" 

kkt^r 


DOC, No. 1 


Fi 




• 
• 
• 


• 
• 




inverted^fiL^5Sf)lh®cToS°™ ^" « ««« is the 

xnvercea file which SIRE used for' retrieval. 



ERIC 



68 

COLLECTING INTEREST STATEMENTS 

The Computer Index to Journals in Education is a data 
base with a broad group of potential users . By selecting the 
clearinghouses on Tests, Measurements, and Evaluation, 
Educational Management, Information Resources, and Teacher 
Education, the group of potential users was harrowed 
considerably. The final users were froui. Syracuse University, 
Cornell University, and the local Syracuse geographic area. 
They included students, faculty, and local professionals. 
In order to assure the users of complete anonymity, no 
specific demographic data were collected. 

Users were individuals witl^ actual information require- 
ments. A psuedo-service was established and appropriate ^ 
announcements were made of its availability in classrooms, 
through mailings, and by word of mouth. A copy of^he 
announcement flier is included in Appendix B. Information 
request statements wete collected on the request forms 
included in Appendix B. The forms were acquired from 25 
October, 19 78 through 15 February, 1979. A total of one 
hundred seventy-three information request statements were 
received, searched, and sent back to the user for relevance 
judgments. One hundred forty were returned with completed 
relevance judgments for a response rate of 80.9%. 

The study required a comparison of representations and 
a measure of tl)k system's ability to rahk relevant documents 



ERIC . 



69 

within a retrieved set. If a specific retrieval set contains 
only relevant documents, then one is unable to measure an 
I algorithm's ability to place relevant aocumenta before non- 
relevant documents. The same is true for a search which 
retrieves no relevant* documents . Thus for the purpose of 
this study, in order for a query to be useable, one relevant 
and one non-relevant document had to be retrieved from each 
representation, of the 140 searches which were returned with 
relevance judgments, 68 had at least one relevant and one non- 
relevant document in each representatitSn . Thus, 48.5% of 
the completed searches with relevance judgments were useable 
for the study of ranking algorithms. 

A sample of the returned output along with the 
instructions to users for relevance judgments is included 
in Appendix 3. 

INTERMEDIARIES ANP SEAI«:HING 

The three intermediaries were selected because they are 
professional , searchers. At the beginning of the study, each 
intermediary had been searching the ent-ire ERIC data base for 
at least one year. The intermediaries were given a brief 
training session on the use of the SIRE system ^nd a one^ page 
description of the appropriate commands for thi|s study 
(Appendix A) . These instructions did not include any des- 
cription of SIRE'S ranking capability nor any of the natural 
language features for searching. Thus the searchers were kept 



70 

unaware of the goal of the study and were unable to use tech- 
niques which might have contaminated the study. Searchers were 
instructed to perform high recall searches. 

Each information request form was duplicated so that it 
could be sent to one intermediary for searching in the con- 
trolled vocabulary and to one for searching in the free term . 
vocabulary. The information requests were assigned randomly 
to the intermediaries. Each intermediary was instructed to 
conduct a search in the appropriate vocabulary, controlled or 
free. By random selection, sixty eight informatioiji requests 
were searched by the same intermediary in both the free and 
the controlled vocabulary. Each of the remaining one hundred . 
five was^earched by different intermediaries in each of the 
vocabularies. IVenty three queries in the free representation 
and twenty seven queries in the controlled representation were 
randomly selected for reliability checks. Each of these 
^information request statements was sutmitted to all of the 
intermediaries for a consistency check on performance. The 
documents retrieved by the originally designated intermediary 
were returned to the user for relevance judgments. Documents 
retrieved by the remaining two intermediaries were used for an 
examination of the overlap of the document sets. 

The output forms returned to users for relevance 
jud^ents were limited to fifty documents. The thirty three 
queries t^at retrieved more than fifty documents were reduced ^ 
to fifty by randomly selecting from those documents in the 



7X 

full retrieved set. The retrieved output was placed, in a 
random order prior to return to the user to control for any 
order effect. A statistical test for or«er effect was 
conducted. For all practical purposes/nc correlation was 
found. The correl ation betw een position on the list and a 
positive relevance judgment was .01. 

r 

In most cases retrieved output was hand delivered co the 
user. Some output was mailed and in sane instances, it was 
picked up directly from the office. A form letter was developed 
requesting the return of the evaluated output after a reasonable 
period of time had passed and the relevance judgment had not 
been returned. This is included in Appendix B. This proved 
to be an effective means of increasing the return. Sixty two 
reminder letters were sent out, resulting in the return of 
thirty nine evaluated output forms or 62.9% of the reminders 
resulted in returned forms. 

Users were asked to evaluate each retrieved reference 
on a scale of J to 4, where 1 indicated direct relevance to 
the information request, and 4 indicated no relevance to the 
information request. 

The data required for this study were the output documents 
and the associated relevance judgments. Each information request 
was kept as an independent unit. For each request, data was 
captured* indicating the documents retrieved, the relevant 
documents, the non-relevant documents, and any documents which 




72 



were not xeturnecl to the users. Relevance information was 
stored as the original 1 to 4 values assigned by the user. 

K 

HoweVer, for the study this information was dichotm^zed to 
indicate the relevance «k»id non- relevance of a document to a 
particular req^uesV. 



QUERY PROCESSING 

Query processing included the acquisition of interest 
statements, clerical procedures prior to sending the inl!ormation 
request to intermediaries, t^^e actual processing (searching) 
by the intermediaries, computer programs to collect^aAd 
rearrange the references, preparation of the output and judgment 
information, delivery of the references to the users for eval' 
uation, and return to the project office for input. The 
results of this entire process were the data required for an 
analysis of the ranking algorithms « Thus, the key factors 
are the requests and the characteristics of the output 

* < 

developed from the requests. 

t 

To characterize the requests and the output, one begins 

with the form of the requests . Each user was requested to 
submit a two or three sentence statement in plain English\ 

I 

f 

descrii^ing their information need. In fact, most request^ 
were two ^r three sentences, with the outliers ranging from 
a few descriptive words (as if selected from a controlled 
vocabulary) .to as many as ten sentences. Each information 



\ 9 . 



73 

request form was delivered to the appropriate raftdomly 
selected Intermediary . There was no 'direct contact between • 
the intermediary and the u^er-»- JBach request was processed 
according to the previous instructions and when the intfer- 
mediary was sajtisfied with the' output, the search was « 
terminated. No controls were put on the length of ^e search, 
, either in time or nusdiW of commands. / 

The data in Table 8 shows the operational character- 
is tics of the sear<^h process. The average number of references 
retrieved in. response to .an inforsiation request was 18. 7. This 
ranged from searches which'' retrieved 0 items to those whi.ch 
retrieved 170 items. 

In the f ree- repreBentation, 17 of the 55 useable queries 
ifetrieved, more than 50 documents or 63.6% of the queries 

retrieved 50 or less documents. 80% retrieved 55 or fWer 

- ♦ , 

documents and 89il% retrieved 72 or fewer documents. The 
/controlled representation shows there were 68 usecible queries 
with 35 or 51.5% retrieving 50 or fewer documents. 80.9%' 
retrieved 72 or fewer documents and 89.7% retrieved 103 or - 
less documents. ^Tlj^ere is a major difference between oile - 
searcher and the other two searchers in terms of the number of 
items retrieved. Searchers A and B retrieved an .average of 

3 documents per query while Searcher C retrieved an average 
of 7.3. Thus Searcher C retrieved 69.9% fewer documents on 
the average than either Searcher' A or B. 



ERIC 



i 



Number 
of 
Searches 




Average 
Retrieved 



Difference 
Between 
Controlled 
and Free 



Average 
Relevant 



Difference 
in Relevant 

Between 
Controlled 

and Free 



Precision 



Difference 
in Precision 

Between 
Controlled 

and Free 



Free 
A Controlled 
Total 



27 
25 
52 



25.1 
19.8 
22.6 



+5.3 



7.6 
7.0 
7.3 



+0.6 



.37 
.41 
.39 



-.04 



' Free 
B Cpntr oiled 
Total 



Free ^ . 
C Controlled- 
Total 



28 
Z6 
54 



25 
27 
52 



29 

22.7 
26 



7.3 
7.3 
7.3 



+6.3 



0.0' 



8.4 
9.1 
8.7 



2.7 
3.7 
3,2 



-.0.7 



-1.0 



.34 
.55 
.42 



.40 
.54 
.47 



-.21 



-.24 



ERIC 



table's 

SUJiMARY OF SEARCH CHARACTER' STICS 



-4 



On the other, hand, the total precision performance figiirea 
•vary by only 88&. The conclusion is that there is a significant 
difference in the documents retrieved by the intermediaries/ but 
the difference in performance measures is slight. 

♦ 

The searchers differed significantly in their perfbrn)ince 
when examineii by representation. Searchers A and B clearly 
* retrieved more documents in the f re^ representation than in the 
controlled. Searcher C performed ideivtltoally in both repre- 
sentations. /However,' the number of releyant retrieved documents 

fellows that searchers B and C wer^ able to use the controlled 

* « ' 

representation more effectively. 

^ . .» 

The lack o,f agreement ^aroong ^documents retrieved across 
across representations and searchej^ clearly does not affect the 
precision achieved by tae intermediaries. Precision ranged from 
.34 to .54. Within the free representation, the observed prec- 
ision ranged from .34 to .40. Within the controlled represen- 
tation precision ranged from .41 to 54. The control 1 repre- / 
sentation provided consistently better precision in this study 
with a searcher difference between representations ranging from 
.04 to .21. One a priori factor constant among the intera^ediaries 
is their previous experience with the ERIC controlled vocabu- 
lary. This may influence the direction and/or the magnitude 
of the observed data. 

To examine differences in the documents retrieved by 
searchers, an overlap study W|is conducted using the 33 queries 
identified for the reliability data. The results of this 
study are shown in Table 9. 

•Q 



76 



«*2VRCH£R 



•> 






Different 


• 

Representation* • 

* * # 


Same 

* 


« 


9% 


Frqe and dontrolled 

• 


Different 

^ 


14% 


5% 




. TSBLE 9 




• 



OVERUiP PERCENTAGES 



The observed overlap is very small, in fact, these ' 
figures could, be explained by chance. By chance is meant 
that the representations are independent with respect to .their 
descriptions of relevant documents and their descriptions of 
non- relevant documents. Tha£ is, both representations may 

• ' ft 

perform similarly in discriminating the relevant from the non- 
relevant documents. However,. within relevant or non- relevant 
subsets there is no relationship between the way the documents 
are described. It is as if retrieval using either representation 
was 'done Oay randomly sampling fxom the same sets of relevant and 
non- relevant documents. Thus different Sets of dbcumerits (but 
the same rel cvant/non- relevant ^percentages) are retrieved by 
each. Another explanation may- be that .the representations are 
systematically different. That is, a searcher knowing that one 
y- the other representation is being used, will s/stematically 
retrieve different documents. T)\e ^data f rom this study does 
not allow for an, in-depth examination of tfiese findiings.- Further 
data focused on this topic are required for a comj^lete under- 
standing of these preliminary findings. 



The complete records, of searches were retained, thus 

allowing exploratory analyses to l^e perused. Two of the more 

• • • 

Interesting features were discovered by lo6ki*ng for fa-tofs 
which correlated with precision. Neither the number of ^OBt'* 
operators nor the number of •*AND" operators correlated 
significantly with the precision measure. However, the ratio 
of "Oft" operators to "AND" operators does correlate positively.^ 
Specifically: • or 

« 

In- other words, there is -a positive relationship between the 
number of "OR" operators relative to the number of "AND" 
operators and precision. The intermediaries appear to begin 
searches by developing concept classes by linki]pg terms to- 
gether with "OR" operators. Once these concept classes have 

m 

been estab).ished, they are linked together by "AND* operators 
for retrieval. It appears that the greater the development 
of these concept classes (word groups connected by "OR" 
Operators) which are connected by "AND" operators,' the -higher 
the precision of the search will be. * 

« 

in .another analysis, it was found that the more, display 

operators the intermediary used, the lower the precision value 

would be. The correlation between the display operators and 

the precision was -.45. In this case, it may be that the less 

sure the intermediary was of their search strategy , t the more 

often the person would take time, to v'rsplay retrieved references. 

* • - • 

The result being that the display commands would be a measure of 

• ■ 



78 

intermedia^ty uncertainty which was. reflected in the precision 
of the search. While studies of this sort are. interesting 
and will -be. ongoing, they are not central to the understanding 
of ranking algorithms. 

Lost Responses . ' > . 

The. evaluated output foxxn from the query processing 
was usually returned to the project office. The return rate 
was 69.99. The 30.1% that were not returned were primarily 
for the users* personal reasons. However, three queries were 
lost , due to intermediary input errors and computer hardware 
probj^ems.- In both situations timely returp to the user' 
was liiade impossible.. * • * 



. A softi^re problem caused the loss of forty-five 
searches from the free representation, but not from the 
controlled representation. The initial 128 queries* were 
processed normally. Subsequent docume^s retrieved by the 
free representation were incorrectly retained and these were 
not delivered correctly to the us^rs for evaluation. The 
data were unrecoverable. * Thus the data from the free 
vocab^ilary tests reflect 128 information request. 



I 



JO, 



79 

MEASURING T!m 
EFFECTIVEMESS OF B ANKING 

The method pf evaluation of this study is the coefficienjt 
of ranking effectiveness (CRE) and an analysis of the factors 
affecting the .cost of ranking algorithms^. • 

Cooper suggested that the essential function of ^a retrieval 
system is not to divide' the data base into retrieved versus 
non-retrieved. sets, but rather to establish an ordering among 
documents based on their relationships to tSe- guelry^ He pro- 
posed to measure the effectiveness of a system by its ability 
to rank order documents - placing relevant documents neal^ the 
beginning of the list (Cooper, ^968). His measure is* the pro- 
portional reduction in the number of non-relevant documents 
that have to be looked at before the query is satisfied, over 
the number that would have had to be examined if the documents 
were arranged randomly. Unfortunately, this measure requires * 
knowledge of all the relevant documents ,in a data base. 

An approximation to this measure has been developed which 

can be computed from a sample without total knowledge of the 

data base. This measure is not considered to reflect a general ' 

assessment of the retrieval system's performance, but rather, 

it me.asures sjiecifically the effectiveness of the order in 

which the output is ranked as opposed to unranked -(randomly 

ordered) output. The. C^fficient of Ranking. Effectiveness * 

(CRE) is defined as ^ where fii^ is the expected 

CRE*= "»j.-K ■ i ^ 



m^— 



80 



meaxi rank of the relevant documents retrieved if the output list 
is randomlyFordered, mp is the expected mean rank of the relevant 
dpcuihents retrieved if the output list is perfectly ordered, and 
R is the observed mean rank of the relevant documents retrieved. 
{m^-mp)»is the distance between the expected mean rank of the 
relevant' documents on a randomly ordered list and their expected 
mean rank on a perfectly ordered list. 'The CRE measures the 
proportion of this distance that is accounted for by the observed 
mean rank. 

If the relevant documents are randomly dispersed through- 
out the list, then their expected mean r^nk (over time) will be 
the same as the>ean rank of all the documents. The mean of 
the. number one through n .(tor n -documents on the list) equals 
(n+i)/2. • ^ 

. \ • . « 

If thez^e are k relevant documents and they are at the 

top of the lisi^ then their mean rank would be (k4-l)/2t ' ' 

CRE = 



m^-mp 



(n-H)/2rg 
(n+l)>J-WlV5 



n-H-2R 
n+1- (R+1) • 



n-k 



by substitution. 



m 

CRE is computed per search. To obtain a mean CRE for a system's 
performance over a number of searches (for example K searches): 



k 

S CRE. 

CRE rariges f rom one through negative one, "o/ e being perfect 
ranking, and zero representing random dispers'ion. A score below 
zero indicates that the system is performing worse than chailce. 
This means that the relevant documents have a low score, and in 
order to correct for this, tl>e system would just sort the list 
low to high instead of high to low. . • • . 

From its definitional formula it can be seen that CRiS is 
interpretable as percent of error accounted forj it represents 
the percentage of the total possible improvement (from random 
to perfect) that has been 'realized by the system. -Since CRE 
is a linear transform of mean and CRE is a mean of CREs, 
CRE can be expected to b^ ;iormally distributed as the number of 
searches on which cM is based increases. [(The expected value 
of C^)a (The expected value of CRE) « O.J 

CRE is relatively insensitive to either the densi^ of 
relevant documents in the data base or in the retrieval set. 
Coint>ared to Cooper's measure, it does not require knowledge 
of all relevant documents in the data base or knowledge of a 
specific,, number of documents the user fee^.s will satisfy his 
query (Cooper, 1968). Another possible measure is Salton's 
normalized precision (Salton, 1968) . Studies'^show that it is 
sensitive to the density of relevant documents in the retrieval 
set and the appearance o.!; non-relevant documents after the 
location of the last relevant document, and that it is very ' 



I 82 

sensitive to early occurrences of relevant documents. .Hence,, 
it measures something other than is desired in our evaluation, 
which is aimed at the ability to rank already defined sets. 
Finally, CRE has an intuitively appealing linearity? a spre<^ 
of 0.5 indicates the mean rank of relevant documents on the 
list is halfway between what would be expected by chance and 
what would constitute perfect per'formance. 

Recall measures will not be qomputed in this study for a 
number of reasons. One reason is that users do not always want 
to see all of the relevant documents (Cole and McGill, 1977) . 
Other reasons are (1) the belief that relevance is not an 
absolute assessment, but rather is a perception of the user, 
and (2) recall taps aspects of a system's performance which 
are not within the scope of this study (e.g., coverage. of fehe 
collection) . . ' 

The evaluation required that the relevance data be avail- 
able and that certain document, collection, and term information 
be available for analysis. . The jreievance data was kept in a 
file organized by search, and «ien by dociament retrieval. The 
ranking algorithms could then be executed and the results 
compared to the document relevance information. The coefficient 
of ranking effectiveness and its standard error were calculated 
directly from this information. 

* 

The SlRiS system automatically keeps type, token, and sum 
of square infoimation for terns in documents. That is, for each 

100 



83 . 

• • • ' 

document/ the into nnation about the number of unique word 
stems, the number of j^ora stems, the frequency of each stem, 
and the sujn of the square of th^requencies" of each term in 
the document is ^tored. This* is explained in. detail in .McGi^l 
et. al. Document length and collection , information were also 
necessary for this study. This information was captured at 
th^- time the" documient was input to the system and retained 
for «se • ith tl>e similarity measures. The similarity .measures 
wftre calculated for each document potentially relevant to a 
query. . The cdeff icient of 'ranking effectiveness and* 
associated descriptive* informatioij was immediately available 
and stored for comparative purposes. The ptocedure is^ 
presented in Figure 8 on page 84. ' 



ERIC 




FIGDRE 8 *^ . . 

CALCULATION OF THE COEFFICIENT 

OF RANKING EFFECTIVENESS • • 



Queried 

created using SOS 
vlth first using 
naiMs and aiding 
with fin confiand. 



Data File 
Information 
from the project 



) 



Invert File 
produced by 
SIRE vlth 
freq. data 




Term Weights 
use .SIRE vlth 
nev front end * 
and modify weight 
calculate in 
Boolean. SAX* 



Document* Length 
contains types 
and tokens > 



Term Weights 
weighted .vector 
.'for each document] 
^to each query 



^Output frem^ RBADIN' 
contains dictionary] 
for each document 
Lf requency • 



Similarity 




. Measures 





Document Length 
produces docummt 
length for particular 
term weighting scheme 



Similarity Measures 



contains similarity 
measure of each doc 
to each query 

~^ 









BeeumentI 


Length 1 




Calculate CRE 




' ^Relevance Judffnents 
contalaa rel. Judge, 
for each doc to 
query. 



'CRE FUe 

CRE, means of rel. 
^and non-^rel. and SD> 



r 



85 

OVERVIEW OF RESULTS 

» 

• The specific results of this study will be presented 
in the following two sections. These results indicate little 
or no differences among* term weighting 'schemes in the con- 
trolled wc|j>ulary. This is, of course, an expected outcome. 

« 

Tt*ere are significant differences among the similarity 
measures. There are eighteen measiures within two standard 
errors of the maximum observed ^alue. Thus, while clear 
differenpes do exist, tiiere is a class of measures which are 
not statistically distinguishable. "Within the free represent- 
ation, classes of term weighting and similarity measures are 
identified whiah-perform significantly better than others. 
Within classes, no distinctron^i^possible . However, 
additional cost ' information is provided^^ assist in the 
selection of a ranking algorithm. The cost data are 
concerned with incremental and relative costs associated " 

m 

with processing and storage. A person using this report 
is advised to include a consideration of term weighting, 
similarity measure, and cost. 

s. 



I- 



-niere are clearly- some schemes which are not useful. 
But of the effective schemes, there is» little basis for 
selecting one scheme over another. Thus, one will be well 
advised to use simple but effective schemes. 




The CMS values for the controlled representation are 
presented in Table lo. Each cell value is a mean of 68 
individual CRE values obtained from the useable queries for 
this study. See page 87 for Table 10. 




\ 



10. 



Term Weighting Scheme Number 



Similarity 
Measure 
Number 




Mean 



110 



TABLE 10 
MATRIX OF VALUES 



0* U 



CONTROLLED REPRESENTATION. 



11 



88 



Bie analysis of the term weighting schemes is presented 
in Table 11. 



Source of 
Variation 


Degrees of 
Freedom 


, Sum of 
/ Square^ 


tiean ^ 
Squares 


. F 


AMONG 


13 


0.027 


0.0021 • 


0.55 


WITHIN 


322 


1.2 


0.0038 




TOTAL 


335 


1.3 

1 — 







TABLE JL^ 
ANALYSIS OP VARIANCE RESULTS 

t 

CCaJTROLLED REPRESENTATIOfJ 
TERM WEIGHTING SCHEMES 

This analysis fails to indicate any significant 
difference among the term weighting schemes. This is not 
surprising since ^e weights for controlled representations 
are determined by using dichotomous information about the 
presence or Absence of a term. Thus, evaluation of dif- 
ferences is limited to the comparisons of similarity measures^ 

The values for eacji cell were examined for differences 
by similarity measure and for differences by .term weighting 
scheme using one way analysis of variance. The results of 
the analysis by similarity measures across term weighting 
schemes are presented in Table 12. A significant difference 



11 



8d 



at the .Ol^levejl is indicated. Thus, similarity measures can 
be selected which will give significantly Ijett'er performance 
on. the average than others. 



Source of 
Variation 


Degrees of 
. Freedom 


Sum of 
Squares 


Mean 
Squares 


F 


AMONG 


23. 


' 0.89 


0.039 


33 


WITHIN 


312 


0.36 . 


0.0012 




TOTAL 

* 


335 


1.3 







TABLE 12 

SIMILAJUTY MEASURE 

ANALYSIS OF VARIANCE RESULTS 
CONTROLLED |(EPRESENTATION 



^ . . Ik . \ * 

In a controlled vocabulary environment, these results 
indicate that the selection of a term weighting scheme 
^is not an important consideration. The selection of a 
similarity measure is an important consideration. In order 
to eiarify the selection from among the similarity measures, it 
is necessary to look for equivalence and disparities in per- 
formance. • 

Tukey's HSD (Honestly Significant Differences) was used 
to determine significant distinctions between pairs of mean 

values from the similarity measures (KIRK) . Table X3 
shows the differences among means. 



Frequency 
of 

Occurrence 



Sindlairity. 

Measure 
Muniber From 
Tal)le 2 



Significantly 
Different 
From Means 

Greater Than 



Significantly 
Different 
From Means 
Less Than 



-.075 
.0036 
.0071 
.049 
.054 
.056 
.075 
.083 
• 090 
.091 
.11 
•'2 

%13 
.14 



1 
1 
1 
1 
1 
1 
1 
1 
1 
2 
4 
3 
3 



29 
50 
iRandom 

6 . 
26 
58 
31 
32 

7 

27,56 
12,44,63,64' 

I, 36,46 
2,4,40 ^ 

II, 30,52 



All Others 

.0516 

.0551 

.097 

.102 

.104 

.123 

.131 

.138 

.139 ' 
None Higher 
None Higher 
None Higher 
None Higher 



None Lower 
-.0404 
-.0409 

.001 

:006 

.008. 

.027 

.035 

.042 

.043 
. .062 

.072 

.082 

.092 



TABLE 13 

SIGNIFIC ANT D IFFERENCES 
BETWEEN CRE MEANS FOR 
SIMILARITY MEASURES 
CONTROLLED REPRj^SENTATION 



11 



«ie critical significant- difference interval is calculated at 
the .05 levfel with 



where 



• ■ 6. 



C^^s the critical significant difference limit 
MSw « is the mean sequence within cells 
il = is the nuniber of term weighting schemes 
*J.05 ° studentized range statistic 

estimated at K=24 N-K««» 

• By observation the dingle best* result was achieved by 
■ the contoinatlen of similarity measure 52 with tern, weighting 
scheme 31. Howe_ver/ it is clear that similarity measures 
U, 30, 52, 2, 4, 40, 1, 36, 46, 12, 44, 63 and 64 may all be ^ 
performing equivalently. Frprii within this collection there is 
no reason to expect one to perfdrm better than another. I-t will 
l>e suggested that in the absence of better information, the 
similarity measure which is the most efficient in storage and 
computation is currently the most desirable from the above set. 
Methods o%. determining storage and procesj^ing efficiency will 
be discussed in a. later section. ' 



J 



11 



RESUOf S USING THE FREE REPJRESENTATIQM 

* I 

The CRE values for the free representation are 
presented in Table 14. * Each value in the table is the 
•result of 55 individual CI^ values obtained from the useable 
queries. * 

* 

4 

The analysis of the term weighting schemes is 
presented in Table 15. . 



< 



lib 



larity 
M««*ur« 

Sumbcr 




TABLE 14 
K&Ttll 0? CRB VALOBS 
FBBB BEPRBSBHTATXON 



ERIC. 



11 



lis 



« 



94 



Source of 
Variation 


< . Degrees of 
Freedom 


. Sum of 
Squares 


Mean 
Squares 


F 


• . 'among 


20 


.24096 


.0120 


2.353 


• WITFIN 


483 


. 2471 


.0051 




TOTAL* 


503 


2,714 







TABLE 15 " 
ANALYSIS OF VAIUEANCE RESULTS 
TERM WEIGHTING SCHEMES 
FREE REPRESENTATION . 



The analysi'S indicates a significant difference among ti.e 
term weighting schemes at the .01 level. Again using Tukey*s 

Honestly Significant Difference, the individual means were 

, , if 

examined to determine if signifiisant distinctions can be made 
between pairs of mean CRE values. 

The significant difference value is 

« 

C " ^.05 t /mSw 
with q 05 at K = 21, N-K = 483 = 5.05 ^ /OOST « .07869 

^"23— 



. The mean Cm values range from .043 to .116. Thus, no 
significant difference is, found between individual p -s of 
term weighting schemes. This situation can arise when linear 
combinations rf term weighting sch^es are significantly 



different but the individual pairs of term yeightlng schemes 
are not significantly different. ' 

The analysis of the similarity measures Is presented in 
Table 16.. 



Source of 
Variation . 


Degrees of 
Freedom' 


Sum of 
Squares 


Mean 
Squares 


P 


BETWEEN . 


23 


1.698 


.0738 


34.9133 


WITHIN 


480 


1.015 


.0021 




TOTAL 


503 

* 


2.713 

0 







TABLE 16 
ANALYSIS OF VARIANCE RESULTS 
SIMILARITY MEASURES 
FREE REPRESENTATIONS 



The analysis indicates a significant difference among the 
similarity measures at the .01 level, Tukey's Honestly Signifi- 
cant Difference was used to examine the significant differences. 
The significant difference value is 



with q 



.05 



C « q^05 -y ^ 
at K « 24 , N-k = 480 « 5.17 

C - 5.17 ^ -0021/24 = .04836 



4 



, . ,96 

Table 17 shows the similarity measures in order along 
with the definitions of those nteasures which are within the 
.Honestly Significant Difference., since the number of 
similarity measures which do nbt indicate an honest difference 
is large (14). then there is justification for selecting a 
.similar j.ty measure from among these top rated measures based 
on its ease of calculation and quantity of storage required/ 



EFFICIENCY CONSIDERATIOMS 

In order to decide whiclj/tanking algorithm(s) to 
implement-, one would like an inbication of the cost to implement 
and execute the algoriUtei in addition to its effectiveness. A 
model whic^ analyses costs is dependent on the specifics of the 
computer installation on which the algorithm is to be 

hardware and/or software can 
'make significant changes in costs. 

The characteristics identified here provide a weak order- 
ing o£ the. ranking algorithms. Factors which affect costs are 
identified and each of the components of the algorithms is ^ 
placed in this framework. The characteristics then injiicate 
the relative cost of the algorithm. 

^ h major component of ranking algorithms is the cost 
associated with the TW, although the SM also affects the cost. 
The following will describe the considerations which appear to 



97 



Mean 
-.104 
:.057 
-.017 
.019 
.053 
.069 
.072 
.074 
.075 
.091 
.096 
.1 

^101 

.10 7 

.108 

.116 

.117 

.118 

.119 

.12 

.121 

.124 



Frequency 
of 

Occurrence 



Similaritiv 
Measure 

MMlv J»WIII 

Table 2 


Different 
GREATER than 


dxgnx X X can V Ji^y 
Different 
ziuiB neaxis 
LESS than 


29 


All Other Means 


None Lover 


50 


-.009 


None Lower 


Random ' 


.031 


-.065 


6 


.067 


-.029 


63,64 


.101 


.005 


31 


.117 


.021 


27 


.12 


.024 


58 


.122 


.026 


'56 


.123 


.027 


52 


^one Higher 


.043 


32 


None Higher 


.048 


7 


None Higher 


.052 


26 


None Higher 


.053 


4.2 


None v:\gher 


.059 


44,46 


None Higher 


.060 


40 


None Higher 


.068 


36 


None Higher 


.069 


2 


None Higher 


.070 


1 


None Higher 


.071- 


11 


None Higher 


.072 


30 


None Higher 


.073 


4 


None Higher 


.076 


TABLE 


17 


« 



ERIC 



SIGNIFICANT DIFFERENCES BETWEEN 
CRE MEANS FOR SIMILARITY MEASURES 

FREE REPRESENTATION 

t 

1?. 



98 



be the most impo^ttant in determining the cost of a ranking 
algorithin. Two considerations are paramount: (I) the processing 
requirements of the algorithin and (2) tlie storage costs of the 
algorithm. An assumption of this analysis is ^at the Tern 

is calculated and stored only once. 

; CO'STS'OF TWs 

The processing costs for the term weighting schemes are 
largely determined by whether the specific weighting of the TW 
requires one or two passes through the data base. Many of the 
weighting schemes required two passes. For example fia/^'i 
required one pass to calculate (a collection statistic) for 
each term and a second pass to determine the weight for each 
term. On the other hand, log (fi„ +1) can be calculated on the 
first pass because no collection information is required. Of 
course, it is possible to trade processing time for storage and 
do the operation fi„/Fi at retrieval. However, this was not 
examined in this study. N 

The incremental storage costs are determined by the number 
of additional storage locations needed to store the actual 
weights. The two choicer are: (1) a weight for each unique 
term in the dictionary, or (2) a weight for each unique term in 
the data base. The former required weights for each unique term 
in the dictionary (generally a collection statistic) for the 
free representation, 10,000 storage unit^,and the weights for 
each uniqye term in the data base required 170 ,000- storage units. 



99 

V The null TW requixies no additional storage or processing. 

That is, the unweighted scheme represents . the lower limit of 

• • • • 

cost of term weighting schemes . > Costs of term weighting 
schemes are presented class with those costing the most' 
appearing first. Within classes there are variations due to 
the complexity -of calculation. These are not considered critical 
since specific calculations are mi^or in comparison to calcula^ 
tions conducted on a data base. 

♦ 

-CATEGORY 1 - TwoiPasses of Data Base Required. 

One Storage Unit per Unique Tezm 
One Storage Unit per Document 

• * 

^■r*in Ti log Pi 



n 1 



*in _ fin " 



100 



CATEGORY ^ - One Pass 

One Storage Unique/Unique Term « Docaument 

1 , f. , -log f . , , , 



CATEGORY 3, - One Pass. 

\ One Storage Unit/Unique Term ~ Dictionary 



JL 




J 

COST OF SMs 

• 7- 

A major factor in the costs of a similarity meastire is 
. the need by the measure of summary statistics of the document, 
such as Sxj^. If this is required, then an additional storage 
unit is required for each document^ihrthe^data base. 

The use of summary information such as document length 
by the SM may have an interaction affect when combined with 
certain TWs. In particular, the use of either l/d^ or ' 
log (N/d|^) with such a similarity measure would alter these 
TWs from one-pass to two-pass TWs. The second pass through 
the data base is required to calculate the length of the 
document for the weighting scheme. 

The discussion of costs of ranking algorithms indicates 
the observed factors which affect cost. Exact cost data is 



installation dependent and thus not generally useful . Hhe 
established categories provide a weak ordering pf the diffe 
ranking algorithms in terms of processing costs anfi storage 
req(»ireroents . The reader should pay .particular attenticm 
to the interaction of teim weighting and si^iilarity measures 
in determining the cost of any particular ranking algorithm. 



102 

CQKCLUSIONS . " , ' 

This study indicates that many. of the ranking algorithas 
currently, in use or 'suggested as effective methods for ordering 
^output are, in fact, equivalent. Further, as one would expect, 
the %exm weighting schemes in the controlled environment are 
simply not important. This is evident , from the lack of signi- 
fxcance shpwjn by the analysis of yariance given that th6 
unweighted- weighting scheme .is isolated .'^ 

t 

J . ^ ■ . 

Term .weighting in the free text environment is significant 

However, the use of Tulcey's Honestly Significant Difference 

. ■ ♦ 

fail^ to indicate a significant difference between. pafrs of 

the term weighting schemes. 

r 

• Similarity measures in both the free and the controlled 
environment are significantly different. Classes* of measures 
which were found to be equivalent have been presented. These 
top rated measures are still disappointing, l^iat is, by 
observation the bpst ranking scheme in the controlled* environ- 
ment had a CRE of .19 and in the free environment the top 
scheme had a CRE of- .22. In other words, in both instances* 
the ranking algorithm was able to improve the order in Which 
documents appe^rdd by about 20%. Thus, 80% of the potential 
benefits from a ranking algorithm is not yet realized. These i 
results do agree in generfil with those attained by Noreault 
who found a 35% iaqprovement. The data seem to indicate that 
the meithods of ranking axfe not using variables which allow 



truly effective ranking (at least in a Boolean Environment). 
-This may unfortunately £»e a major factor in one's ability to 
create truly effective systems. Maro'n (1979) has recently 
staged thatj / ^. » 

^ "Two valued thinking about indexing (and 
retrieval) leads "system designers to worry about 
thresholds, cutoff values, and depth of indexing 
in order to insure that the two-valued decisions 
are optimal for the patrons for whom hixe system » 
is designed to serve, ^ut these days .with the 
groyth of very large files and especially yfith 
the growth of on-line, interactive document 
retrieval systems, perhaps it is most rational 
, to build systems that provide .maximum flexibility 
for each patron. This means that designers should 
build systems which rank the documents, relative to 
an input query, by probability (or degree) of - 
satisfaction, and set no preestablished cutoff 
thresholds. Instead of binary indexing, we 
reoomroend the use of weighted indexing and ranking 
the output documents . " 

' , ■» « 

His stated goal is ranking the output, but unfortunately 

it is not cle^r ^rom this study that the use of weighted 
indexing will provide his desired results. Ongoing analyses 
of this ranking data may help to indicate the variables or 
variable types that contribute positively to the ranking 
process. ' , ^ ^ . 

The results of this study have raised many questions. 
At a very basic level, oi>e needs to understand the data 
generated by the stuBy of the overlap among retrieved sets 
of documents. The observed data indicate that the specific 
documents retrieved by an individual representation are 
different from those retrieved by another representation 



104 

t 

even though the original 'information need is identical. 
Further > the data show that the overlap in documents retrieved 
by different searchers is sioall. That is, in respdi^se to the 
sane information need, different searchers appear to be 
retrieving different document^. This may be an artifact of 
this partifcular study, or it may be a general situation. It 
would seem 'that only additional data win 1 answer" this. 

In either case, there is a sense of uneasiness associated 
with conducting a study which examines specific documents when 
there' is question about the factors underlying the se|.ection 
of the documents. Fortunately, this did not detract from the 
methodplogy used in this study. The effectiveness of thp 
ranking algorithms measures changes in order after the set is 
retrieved. 

Ihe data also suggests that the use of frequency infor-^ 
mation, whether by document or by collection, is limited in 
its current ability to rank documents The term weighting 
schemes and similarity measures were selected to represent 
the schemes available in the open literature. Thus, a 
significant increase in the ability of a ranking algorithm is 
not likely to occur by a calculation, which employs some 
rearrangement of these frequency variables. Rather, it seems 
that new factors will have to be identified to resolve a 
significant portion of the 80% benefit not attained by current 
algorithms. 



The current study does show that documents can be 
rearranged to aid the user. The rearrangefaent will, in 
general, mov^ relevant documents toward the beginning of 
the list of output documents. The overall effect is 
beneficial to the user. Further, if the. algorithm is 
selected using €he data about effectiveness and efficiency, 
then the^cost *to the system can be minimized whilje giving all 
the benefit we know how to provide at thi,s time. 



/ 



130 



X06 

REFERENCES - ' 

ARTANDI , S . J WOLF /e . H . 19 69 . "The Effectiveness of Auto- 
matically Generalted Weights and Links'*. American 
Documentation, 20C3) sl92-202 C1969) . 

BALL, G.H. 1965. "Data Analysis in the Social Sciences: What 
About the Details?", proceedings of AFIPS Fall Joint 
Computer Conference, 1965, 533-559. 

BOOKSTEIN, A. 1977. Personal Communidation, 1977. 

SOOKSTEIN, A.? COOPER, W.S. 1976. "A General Mathematical 
;Nodel for Information Retrieval Systems". Library 
Quarterly, 1976- April; 46(2) :153-167. 

CA^AN, c. 1970. "A Highly Associative Document Retrieval 

System . Journal of the American Society for Information 
Science. 21s330-337 (1970) . »*u*««T.xon 

CARROLL, J.M.f ROBLOPFS, R. 1969. "Con^uter Selection of 
Keywords Using Word-Frequency Analysis". American 
Documentation , . 20 ( 3) j 227-233 ( 19 69 ) . 

CLEVELAND, D.B. 1976. "An n-Dimensional Retrieval Model". 

2??SJ?Ml2^5l7^!?J6K ^""^^^^ ^^'^ information Science. 

CLEVERDON, C. j KEEN, M. 1966. "Factors Determining the 

Performance of Indexing Systems", Volume 2. Test Results. 
ASLIB, Cranfield Research Project, 1966. 

COLE, E.J McGILL, M.J. 1977. "Professional Activities 

Research Scientists Aided by CANISDI; An Z^proach to w«- 
Information Client". Syracuse, NYs School of Infozm^^n' 
, Studies, Syracuse University; 1977, -r- 

COOPER, W.S. 1968. "Expected Search Length: A Single MeasiJ^ 
of Retrieval Effectiveness Based on the Weak Ordering 
of Retrieval Systems". American Documentation. 1968 
January; 19(1): 30-41. 



COOPER, W.S. 1970. "On Deriving Design Equations for Information 

Retrieval Systems". Journal of the American Society for 
' Information Science. 21(6) :385- 395 (1970). 

CORMACK, R.^i. 1971. "A Review of Classification". Journal of 
the Royal Statistical Association, Series A. 
134:321-353 (1971). 



107 



EDMONDSON, H.P.j WYLLYS, R.E. 1961. "Autoroatic Abstracting 

and Indexing - Survey and Recommendations". Communications 
of the ACM, 4(5)s226-234 (1961). unicai^xons 

GEBHARDT, F. 1975. ""A Simple Probabilisto^ Model for Relevance 
Assessment of Documents-. . Information^rocessing and 
Management. 1975^11(1/1)159-65. 



V '\ 



BARTER, s. 1975. "A Probabilistic Approach to Automatic Key- 
word Indexing: Part I". Journal of the Aa^rican Society 
for Info3cmation Science.. 26(4) :197-206 (1975). 

BARTER, S. 1975. "A -Bce^abiiistic Approach to Automatic Key- 
word Indexing: Part II". \ Journal of the American Society 
for Information Science. ^(5) : 280-289 (1975).. 

KATTER, R. 1967. "A Study of Document Representation: Multi- 
dimensional Scaling of Index Terms". SDC-Final Report. * 
August 31, 1967. , ' 

KATZEft/^ 1971. Syracuse University Psychological Abstracts 
Retrieval Selrvice. Pinal Report. "Large 3cale Information 
Processing Systems, Section V: Cost-Benefit Analysis". 
Syracuse University, School of Library. Science, 1971. 

KEEN, E.M. 1973. "The Aberystwyth Index Languages Test". * ■ 

Journal of Documentation. 29(l):l-35 (1973). 

KIRK, ROGER JB. 1968. "Experimental Design." Brooks/Cole 
Publishers Inc., New York (1968). 

KUHNS, J.L. 1965. "The Continuum of Coefficients of Association". 
In: Stevens, M.B., Giuliano, V.E. and Heilprin, L.B. (eds) 
Statistical Association ^tethods for Mechanised Documentation, 
Syroposxum Proceedings, Washington, 1974 (N^S Misc. Publ. 
Mo. 269, 1965), 33-40. ► 

LANCASTER,^P.W.; FAYEN, E.G. 1973. "Information Retrievals 
On-line", tos Angeles, CA: Melville Publishing Co. ; 
1973. 414. 

LERMAN, I.e. 1970. "Les Bases de la Classification Automatioue. "• 
^ Paris: Gauthier-Villars, 1970. 

LUHN, H.p. 1957. "A Statistical Approach to Mechanized Encoding 
and Searching of Literary Information". IBM Journal of 
Research and Development, .1(4) :309-317 (1957). 

MARON, M.E. 1979. "Depth of Indexing". Journal of the American 
Society for Information Science. Volxuae 30, Nuinber 4. 
July 1979, p. 227. 



ERIC 



108 

MARON, ME.; KUHNS, J.L. 1960. "On Relevance Probabilistic 
Indexing and Information Retrieval", Journal of the 
. ACM. T(3):216-244 (1960). ^ i^e 

\McCAl?N, D. 1976. Personal Communication. 

McGILL, M.J,; SMITH, L.C.; DAVIDSON, S.; NOREAULT, T, 4.976?. 
•'Syracuse Information Retrieval Eacperiment (SIRE) $ 
Design of lan On-line Bibliographic Retrieval System**. 
SIGIR PORUM of the ACM, X(4) s37-44 (1976) . 

MINKER, J.; WILSON, G.A.| ZIMMERMAN, B.H. 1972. "An Evaluation 
of Query Expansion by the^ Addition of Clustered Terms for a 
Document Retrieval Sj^stam". Information Storage and 
Retrieval. 8:329-348, (1972). 

NOREAULT, T.; KOLL, M. ; McGILL, M.J. "Automatic Ranked Output 
from Boolean Searches in SIRE". JASJS 1977, 28:333-339. 

ORLOCCI, L. 1969. '•information Theory Models for Hierarchic - 
and Non-Hierarchic Classifications'*. In: Cole, A.J. (ed) 
Numerical Taxonomy, Proceedings ^ of Colloquium in Numerical 
Taxonomy at University of St. Andrews, Scotland, 1968. 
NY: Academic Press, 1969 # 148-164. 

OVERALL, J.E.; KLETT, C.J. 1972. "Applied Multivariate Analysis" 
Englewood Cliffs: Prentice Hall, 1972. 

REITSMA, K.; SAGALYN, J. 1968. "Correlation Measures" . in: 
Info nnat ion Storage and Retrieval. ISR Report No. 13. 

RICKMAN, J. T.- 1972. "Automatic Storage and Retrieval for On-Line 
Abstract Collections". Doctoral Dissertation, Pullman WA: 
Department of Computer Science, Washington, State University; 

ROBKRTSON, s.E. 1974. "Specificity and Weighted Retrieval". 
Journal of Documentation. 30(l):41-46 (1974). 

SAGER, W.K.H.; LOCKEMANN, P.C, 1976. "Classification of 

Ranking Algorithms". International Poriim on Information 
and Documentation. l(4):2-25, (1976). ' 

SALTON, G.; LESK, M.E. 1968. "Computer Valuation of 
Indexing and Text Processing". Jouri4l of the ACM. 
15(5) :9-36 (1968) . 

SALTON, G. 1968. "Automatic Information Organization and 
Retrieval". New York: McGraw-Hill Book Company '1968. 



109 



SALTON, G. 1969. "A Comparison Between Manual and Automatic 
Indexing Methods". American Documentation. 20(1);61-71 
(1969) . 

SALTON G.f YANG, C.S. 1973. "On the Specification of Term 

29{4r351-37?°1973 ' Journal of Documentation.' 

SALTON, G. 1975. "IJynamic Information and Library Procgsaing- 
Prentice-Hall Inc. Englewood Cliffs, NJ, pp. 504. 

SALTON, G.; YANG, C.S., YU, C.T. 1975. "A Theory of Term 
Importance in Automatic Text Analysis". Journal of the 
* '^f^c*^^'* Society for Informatiori Science, 26(1) s 33-44 

SALTON, G.J WONG, A.,- YU, C.T. 1976. "Automatic Indexing 

Using Term Discrimination and Term Precision Measurements" 
Information Processing and Management. 12:43-51 (1976). 

SARACEVIC, T. 196^. "An Inquiry into Testing of Information 
Retrieval Systems". Con^arative Systems Laboratory 
Technical Reports No. CSI,sT:vpxnai, 1 to 3. Cleveland- 
Center for Documentation ana Communication Research, 
Case Western Reserve University, 1968. 

« • 

SN5ATH, P.H.A.? SOKAL, R.R, 1973. "Numerical Taxonomy : The 
Principles and Practice of Numerical Classification". 
San Francisco, CA. W.H. Freeman and Co., 1973). 

SPARCK JONES, K.; JACKSON, D.M. 1970. "The Use of Auto- 

matically-Obtained Keyword Classifibations for Information 
Retrieval". Information Storage and Retrieval. 
5{4):175-202 (1970). 

stARCK JONES, K. 1972. "A Statistical Interpretation of Term 
Specificity and Its Application in Retrieval". Journal of 
Documentation, 28:11-21 (1972). 

CPARCK JONES, K. 1973. "Index Term Weighting". Information 
Storage and Retjrieval. 9:619-633 (1973). 

SPARCK JONES, K.. 1974. "Automatic Indexing: A State-of-the- 

Ajt Review". Coinputer Laboratory, University of Cambridge, 



SVENONIOUS, E. 1972. "An Experiment in Index Term Frequency" 
23a09^121 ^(1972?^^°*" Society for Information Science. 

SWETS^ J. A. 1967. "Effectiveness of Information Retrieval 

Methods", Bolt Beranek and Newman Rept. 1499. Cambridge. 
Mass. April 1967. ^ 



110 



SWIT2ER, P. 1964. "Vector Images in Infoanation Retrieval". 
In Statistical. Association Methods for Mechanical 
Documentation, Symposium Proceedings, Washington, D.C. 
1964. (NBS Misc. ^ubl. 269, 1965). pp. 1|S3-171. 

TAGUE, J. "An Evaluation of Statistical Association Measures- 
> Proceedings of American Documentation, 1966, 391-397, 

TARS, A. 1976. "Stemmina- As A System Design Consideration". 
ACM SIGIR FORUM, Spring 1976,. pp. 9-16. 

TORGERSON, W.S. 1958. "Theory and Methods of Scaling". NY: 
John Wiley and Sons, Inc. 1958. 

van fiuSBERGEN, C.J. 1975. "Information Retrieval". London: 
Butterworths, 1975. 

van RIJSBERGEN, C.J. 1977. "A Theoretical Basis for the Use 
of Co-occurrence Data in Information Retrieval". 
Journal of Documentation, 33(2) :106-119, (1977). 

YU, C.T.; SALTON, G^1977. ."Effective Information Retrieval 
Using Term Ap<5uraoy". Commi?nications of the ACM, 
20 (3): 135-142, (197.7). 



i 



111 . 



APPENDIX A 



ERIC 



Appendix A 
112 



NSF RANKING PROJECT 
Instructions to Searchers 



1. 

2. 
3. 



A. 

5. 



C. 
B. 



'0. 



k I 

Make sure first digit of each Infomatlon Requir^ent Statement (IRS) 
I.O. nuad>er Is yours. 

Process the iRSs in the order given to you* i 

a. IRSs marked ERICF (having I© nusibers ending In 0) are to be 

searched against the ERICP file. This Is a dictionary of 
stems from document titles and abstracts. 

b. IRSs marked ERICC (having ID numbers ending In 1) are to be 

searched against the ERICC file. This Is an Indexer^asaigned, 
controlled Vocabulary. 

ft 

Listings of both dictionaries are avallsible in hard copy. 



a. 
b. 



Search as you wouldSinder normal working conditions. 
Try to provide the user with high recall. That Is, 
lean towards inclusion, rather than exclusion, of 
possibly relevant documents, 
c. You may oecasloittally process the saro request 'twice, 

once against ERICF and once against ERICC. Treat these 
as Independent requests. Treat the second IRS as if you 
had not read It before. 

Use Che IRS forms as worksheets. 

Refer to SIRE Instruction Sheet for aid using SIRE. 

Wlien satisfied (or at least done) with the set of documents 
retrieved for an IRS, issue the "DONE*' command with the full 
IKS ID number. 

When done, rip off paper from your terminal and Insert in folder 

4hen a terminal session is through, jetum ll^ forms for 
completed searches, along with all terminal output in file 

CoTders. 



Appendix A 
113 



NSF RANKING PROilECT 
SIRE Inatntctlon Sheet 



BooleajEi query 



Submits a Boolean logic query. An "and", "or" o^ "tiot" mist appear 
between each search term. The Boolean operators are processed 
left to right. 

Save N 

Saves the results of previous search in location N. N is an integer 
value between 1 and 5 inclusive. A saved set my be used as a 
term In a later search by using *N in the query. 

List 

Lists the document numbers retrieved by the previous search. 



A 



Type Abstract" - Types coisplete bibliographic citation plus the 
abstract and descriptors for the fiHh ranked document. N may also 
be a range of ranked documents, e.g. "TA 1-5" types the first five 
documents from the retrieved set. 

"Type Short" - Same as TA except abstract and descriptors are 
not printed. . 

'K>NL Quttty Number 

Whoa finishing a search, issue this command before next search. 

Ends SIRE execution. 

'ivU.h file name 

ERICF - for title and abstract 
KRICC - for controlled vocabulary 

In the controlled vocabulary, the words in a single search tena are 
separated using a "/". If the word is not in the dictionary, it will be 
.chocs with 0 postings. If it is in the dictionary, the code for the 
wi.rd will be echoes. 



115 



' instruct ions for T racki ng Queries 



Fill in the following chart as api^ropriate: 







N 






Date . 


Date 


Date 








IRS 


Out 


In 


Out 


In 








Rccvd 


To 


From 


To 


From 










Searcher 


Searcher 


User 


User 


1 . 


0001 


0 


• 










2. 


0001 


1 












3. 


0002 


0 












4. 

• 
• 


0002 

• 
• 

• 

• 


1 

• 













iifticn Information Requirement Forms (IRS) are received from user, ansign 

citoh an dccossion number (NNNN) . 

Make 2 copies of each IRS. 

File master by accession nuiriber. 

T.abol one copy "ERICF" and.add "O" to end of accessibn number. Label other 
coi)y "I- Rice" and add "1" to end of accession number. 

For each copy, individually, assign a random number to the front of the 
afcci.Gsion number. Roll diei 

IF ASSIGN 
' 1. 2 1 
I 3, 4 2 

5, 6 3 * 

Crcuie a file folder for each copy, labelled n|nnnn|iI, Store separately. 
Give folder to appropriate searcher. 

•t 

Wlicn searcher returns folder, prepare printed o.utput for return to user. 
(Cut, burnt, . staple and asseioble with original IRS). 

When relevance" assesmcnts are returned from user, add to the NNNN 0 folder. 



SYRACUSt UNIVERSITY 

SCHOOL OF INFORMATION STUDIES 



We'tflli conduct • eomputer iMreli ot four Catputer Mew Journals 
in Educatioo data l«.a. for you If you will atoply tali „. «hat it 1. you 
vouU Ilk. ua to aearch f« and tell ua h«, « did after tha aaarcb. You 
will hava Wa to tba data baad. eraattd by tha claarlnghoaaoa an 

, mmsmMud m(mm ^tmm^ Tiio0o''dat.;ba.«a »»> fto« 1975 

through Itasa mada availabla laaa than a oonth ago. 

The attached fom i. for yout to daacrlhe the topic of Intareat. 
Don't worry »bout trying to aay it In oonputaraie. You .ay it to 
EnsUsh. f^a hava trained people to oake aura tiiat your .oarch la 
conducted profaaaionally. to ahout a tfoak you will raoelv. a Hat of 
referoocaa and ab.tracta found on your topic. You will' then be naked to 
Indicate iihlch of theoe are pertinent to y^r Intoreata. ihat la ell ' 
there la to It. You keep one copy of the coaputer output, and return 
one copy to ua telling, ut which referencea ere pertinent . 



Address: 



Instructloos to Participants 

JSf^'i^^^^r Bdwatlon. you need Inforumtton on right 

f r ?° I!® * planning a talk you probably have a topic In 

faSlilar wlS. ^'^^ working on, consider one you are 

virf n.*""* ^^^^ information for your topic we e^nt you to write doWto 

r^!"^ 1!° '^*»»fF«»«°t» " y«» were talking to : colleague who under- 
stood- the field as well as you do. 

SfMa^i^*'^^r^°^;* ^^^'^ concise as possible. This statenent should 
bLJ! «5 ►Jir*^/? * knowledge i?f education would, on the 

iDterost tS JoS" •iwet.lw able to pick out sources which would be of 

In 2 - 4 sentences describe the Information you want: 



In a short tine you will receive a list of references and abstracts that have 
J!!*" ! ,S •**«P«*«'f,^'«» • <*«ta base consisting of document references 
from Conputer Indeit Journals In Education. You will be asked at that tine to 
lot us know which of these references you think would be pertinent, to your 
interest. Thank you for your cooperation/ 

NSF Information Retrieval Project 
School of Information Studies 
113 Euclid Avenue 
\ Syracuse, New York 13210 
(31S) 423-4522 



i 



118 

* • 

NSF WtmAXKM RETRIEVAL FROIECT 



INSTRUCTIONS TO PARTICIPANTS 

of a U«t of refarance.. Each rafaranea conalata of sevan parta: 
DN - Documenc Identification nuabar 
TI - Title 
AU " Author 

SO - The aource of the reference (eg; The title of the Journal in 
* ' which the article appeared) 

AB <• Abatract 
DT - Data' 

DB - Daacriptora of the reference 

(b) i"yJ«if .J*lce%'' ^« ^^^'Py 

FroQ each citation and abatract you form an idea of vhat that particular 
docuoient (book article, report) 1. about. Compare thia to your InS 

related to ycur topic. Baaed on the information in front of you i. the 
docuiaent relevant to your topic or uot relevant to Wlm^ yo., Lad in olnd. 

Judge on a acale fron I to 4: 

1 - Definitely relevant to your topic. 

2 - • probably relevant to your topic. 

3 - Probably not relevant to your topic. 

4 - Definitely not relevant to your topic. » 
Place tht» nitniher In th* box ntiutt- to ctaeli Teferenoe. 



1"' 



ERIC 



SCHOOL OF INFOSJUATION STUDIES SYRACUSE UNIVERSfTY 

« Date 



COMPUTER SEARCH 



Recently the NSP laf ormatlpn Retrieval Project 
prepared a computer search for you. 

* ■ * ■ 

Part B was for -your reference and Part A vas to 
have a relevance Judgment and be returned to us. As 
yet we have not received Part A from you. 

It would be appreciated if you woald complete 
the judgement and return Part A to 

NSP Information Retrieval Project 
'School of Information Studies 

113 Euclid Avenue 
. Syracuset New York 13210 

(315) 423-4522 

Thank you for your prompt attention to this 
request. 



Michael J. HcGill 



